Answering questions by means of Natural Language Processing

Main content

Answering questions by means of Natural Language Processing

zaterdag 28 april 2012 13:42

In the past year I have been working on a framework to allow users to interact with a knowledge base through written natural language. The goal is to provide a simple library that allows you to ask a question in plain english (or some other language) and receive a complete sentence for an answer. I do this in the sparse hours of free time that I have, so the project is a long way from being finished. However, I have now reached the point where the program can make a full round-trip from question to answer for a single question. This is a good moment to write about the choices I made and to explain the process.

There are many ways to process a sentence. I am going to describe the one I chose. I am still in the middle of all this, so don't expect a finished product. I wrote it to explain the big picture.

In this article I will take the following question as an example.

How many children had Lord Byron?

Follow the article to find out the answer :)

Understanding

Extract words

In the first step the words of the sentence are extracted. I have not found any use for punctuation marks at present so we can discard them. Whether a sentence is a question or a statement will be deduced from the structure of the sentence.

The product of this step is this:

[How, many, children, had, Lord, Byron]

Group lexical items

In the next step words that go together are aggregated into lexical items. The goal of this step is to find the smallest possible groups of words that still have a single part-of-speech (noun, verb, determiner, preposition, etc.). This way, compounds like "barbed wire" are now considered to be a single item. I also use this step to find proper nouns (i.e. the names of people, like "Lord Byron"). The words that are found in the lexicon are lowercased, the words that are suspected to be proper nouns are left unchanged (since there is no use to lowercase them, and I would need to uppercase them again in the production phase). This step produces this array of words:

[how, many, children, had, Lord Byron]

Parse the words to form a phrase specification

This is a big step. It is based on section 11.5 of the book "Speech and language processing" (Jurafsky & Martin) which describes parsing with unification constraints. In this step a syntax tree is created based on the array of words we just found, and a set of production rules like these:

S => WhNP VP NP             (how many children had Lord Byron)
WhNP => whwordNP NP         (how many children)
NP => determiner noun       (many children)
VP => verb                  (had)
NP => propernoun            (Lord Byron)

The parsing grammar has more rules, but these are the ones needed to parse our sentence.

Four choices were made at this point. I decided to go for Phrase Structure Grammar, not some other grammar, mainly because the books I happened to read treated this type of grammar much more extensively. Other grammars may work equally well, or even better, though. I chose the Earley parser, because it was said to have the best time efficiency. I have not compared it to other parsers, and those might be better in some instances.

This tree represents the syntactic aspect of the parse. It does not show the features that are attached to each of the syntax tree nodes. These features provide both extra syntactic and semantic information. At the same time the sentence is parsed into a parse tree, the feature structures that are bound to each of the words in the sentence are propagated up the tree via unification with the feature structures of the production rules.

The third choice I made was to integrate semantic parsing with syntactic parsing. I did this because I figured that producing multiple resulting parse trees was not really very useful. It is a neat feat, but it is not important for my project. One would still need to chose one tree over the other in order to answer the question in the best possible way. So why not try to find the best tree immediately, and leave it at that? Adding semantic constraints to the parse process helps eliminate ambiguous trees, and I needed semantics anyway. So my strategy is to add as many constraints to the parse as I can, and consequently pick the first parse tree that comes out.

I would like to explain to you the incredible power of feature structures in a single paragraph, but I'm afraid that's impossible. See chapter 11 of Jurafsky & Martin for an extensive explanation. I added feature structures to each of the production rules, like this one:

array(
    array('cat' => 'S',
             'features' => array('head-1' => array('sentenceType' => 'wh-non-subject-question', 'voice' => 'active'))),
    array('cat' => 'WhNP',
             'features' => array('head' => array('sem-1' => null))),
    array('cat' => 'VP',
             'features' => array('head-1' => array('agreement-1' => null, 'sem-1' => array('arg1{sem-2}' => null)))),
array('cat' => 'NP',
            'features' => array('head' => array('agreement-1' => null, 'sem-2' => null))),
),

You will recognize the production rule (S => WhNP VP NP) when you look at the 'cat' attributes. The array above can be represented like this:

It really is as complicated as it looks. Feature structures are not simple. They form a different programming paradigm and require some time to learn. In the feature structure above, S shares all features of VP, via the head node. This means that the VP is the base node of the structure and the S inherits all of it, including its sem (semantics). You can also see that NP and VP (via S) share agreement. That is, NP and VP should have the same person and number. This makes it impossible that the sentence "How many children has we?" will pass. Finally, you can see that NP forms the first argument (arg1) of the predicate expressed by VP.

Now the feature structrures of the constituents WhNP, VP, and NP, are unified with the feature structure of the top level production rule above, and at the same time the sentence's semantics bubbles up from bottom to top.

Semantics starts at the lowest level, where each word in the language has a feature structure attached to it. Here's an example for the verb 'had' that is used in our example sentence:

'had' => array(
    'verb' => array(
        'features' => array(
            'head' => array(
                'tense' => 'past',
                'sem' => array('predicate' => '*have'),
            ),
        ),
    ),
)

Parsing the sentence creates the following unified feature structure which we will call a phrase specification.

Array
(
    [head] => Array
        (
            [sem] => Array
                (
                    [predicate] => *have
                    [arg1] => Array
                        (
                            [name] => Lord Byron
                        )
                    [arg2] => Array
                        (
                            [category] => *child
                            [determiner] => *many
                            [question] => *extent
                        )
                )

            [tense] => past
            [sentenceType] => wh-non-subject-question
            [voice] => active
            [agreement] => Array
                (
                    [number] => s
                    [person] => 3
                )
        )
)

In this specification you see the sem (semantics) part that describes the predicate and its arguments. All three of them are objects that have attributes. The arg2 object contains an unknown, a variable, that is the root of the fact that this sentence is, in fact, a question. Next to these semantic aspects there are syntactic aspects, like tense and voice.

This allows me to explain the fourth choice I made: I don't create a logical representation of the sentence, but a syntactic-semantic representation, which is called a phrase specification. And this type of structure is used moreoften in language generation than in language understanding. I found the best descriptions of it in the book "Building natural language generation systems" (Reiter & Dale). This structure, which is a combination of syntactic and semantic features, keeps intact the structure of the sentence(!) and saves information about tense, voice, etc. This is all done to make it easier to create the answer, as we will see later on. But first we need to find the actual answer.

Finding the answer

Turn the phrase specification into a knowledge base representation

Now that we have the question, we need to find the answer. Luckily, these days you can find many Open Data knowledge bases on the internet that have publicly, and even free, accessible interfaces! I used one of them which I will not disclose at this point. It has a SPARQL interface. So I wrote a function that turns the phrase specification above into a SPARQL query:

SELECT COUNT(?id151) WHERE {
    ?id193 <http://some_enormous_database.net/object/child> ?id151 .
    ?id193 rdfs:label 'Lord Byron'@en
}

This knowledge base returns a data structure that tells me that the answer is:

The children were called Ada and Allegra [1].

Ada Byron (later: Ada Lovelace) Allegra Byron

Would it have been easier to turn our question into a SPARQL query if the representation had been a set of logical propositions? Probably not. You see, there are many ways to represent semantic information. For a simple example, the relation child(x, y) could have been represented as father(y, x). Other representations are even more different. So there will always be a conversion step between our way of representing relations and the way the knowledge base represents relations, except in the case that you write your parser uniquely for this specific knowledge base. I wanted to write a more generic framework that allowed the use different knowledge bases.

Generation

Integrate the answer into the phrase specification of the question

I would like to have the language processor give me the answer in a full sentence. We start by filling in the the answer that the knowledge base gave us into the phrase specification:

Array
(
    [head] => Array
        (
            [sem] => Array
                (
                    [predicate] => *have
                    [arg1] => Array
                        (
                            [name] => Lord Byron
                        )
                    [arg2] => Array
                        (
                            [category] => *child
                            [determiner] => 2
                        )
                )

            [tense] => past
            [sentenceType] => declarative
            [voice] => active
        )
)

We modify the existing phrase specification, and fill the variable. Then we change the sentenceType from question to declarative and we're done. This is important. We don't need to think of a sentence structure to express the knowledge we have. It's already there. It's like performing Jiu jitsu on the question: use its own strength against it.

Generate a sequence of lexical items from this phrase specification

Now we need to go the opposite way. We need to create a string of words from a tree structure. That seems easy. Just use the same production rules in the reverse order. Unfortunately it is not that simple. The production rules can be used, but the feature structures need to change. I found two reasons for this:

Generation is not a search process. If you don't specify which rules to use for generation, the processor will try many possible paths that will fail. This is a big waste of time!
Many feature structures used for understanding just pass meaning up to the node above. If this process is reversed the meaning stays the same, and this will cause an infinite loop for recursive rules like NP => NP PP.

The essential difference between understanding and generation is:

"Understanding is about hypotheses management, while generation is about choice."

(free, after Speech and Language processing, p. 766)

Generation is a process of hierarchical choices: each choice is based on a part of the phrase specification. Let's start at the top. I will show you the top-level production rule I use to generate the answer to the question:

array(
    'condition' => array('head' => array('sentenceType' => 'declarative', 'voice' => 'active')),
    'rule' => array(
        array('cat' => 'S', 'features' => array('head' => array('tense-1' => null,
            'sem' => array('predicate-1' => null, 'arg1-1' => null, 'arg2-1' => null)))),
        array('cat' => 'NP', 'features' => array('head' => array('agreement-2' => null, 'tense-1' => null,
            'sem{arg1-1}' => null))),
        array('cat' => 'VP', 'features' => array('head' => array('agreement-2' => null, 'tense-1' => null,
            'sem' => array('predicate-1' => null)))),
        array('cat' => 'NP', 'features' => array('head' => array('sem{arg2-1}' => null))),
    ),
),

The structure has two parts: a condition, and a rule. Only if the condition matches, when it is unified with the partial phrase specification, the rule is used to generate part of the syntax tree. The feature structure is different from the rule we saw earlier: it makes sure that meaning is distributed over the consequents. Generation is a recursive process, like understanding. First the S node is matched to the rule I just mentioned and the top-level phrase specification is unified with the feature structure of the rule. Then the process is repeated for the NP, VP and NP. And so on, until the lexical items are found.

While the understanding rules are ordered by frequency of occurrance, generation rules are ordered by decreasing specificity. The top ones have more conditions.

The result of this process is an array of lexical items:

[Lord Byron, had, 2, children]

Punctuation and capitalization

To form a proper sentence, it should start with a capital letter (which it already had in our example) and end with a period:

Lord Byron had 2 children.

That's what we wanted to hear :)

Books

The material above I found in these books. The books allow many alternative routes to follow, none of which are worked out to a great extent. After you read them, you will have some idea where to go, however.

Speech and language processing, Daniel Jurafsky & James H. Martin (2000)
Building natural language generation systems, Ehud Reiter and Robert Dale (2000)

Labels: nlp

« Terug

Reacties op 'Answering questions by means of Natural Language Processing'

Pim	Geplaatst op: 28-04-2012 15:39	Quote
	Wat ontzettend gaaf Patrick! Als ik vragen mag, wat ben je van plan met het framework te gaan doen? Ga je het ergens publiceren, of ga je een knowledge-base-zoeker online zetten of...?
Patrick van Bergen	Geplaatst op: 28-04-2012 17:25	Quote
	Dank je wel, Pim! Zodra de architectuur eenmaal stabiel is en ik ervan overtuigd ben dat ik de juiste keuzes genomen heb, wordt het een open source project en nodig ik anderen uit om aan mee te werken. Het doel is inderdaad om diverse knowledge-bases een NLP interface te geven.
louboutin pas cher	Geplaatst op: 26-06-2013 15:42	Quote
	Hello, I enjoy reading all of your article. I like to write a little comment to support you.\| louboutin pas cher http://www.ideavelopers.com/louboutinfr.php
louboutin	Geplaatst op: 02-07-2013 06:31	Quote
	Thank you for the good writeup. It in fact was a amusement account it. Look advanced to more added agreeable from you! By the way, <a href="http://www.ideavelopers.com/tomsshoes.php">toms outlet</a> how can we communicate?\| louboutin http://www.ideavelopers.com/louboutinfr.php
toms shoes	Geplaatst op: 02-07-2013 12:43	Quote
	I am sure this piece of writing has touched all the internet users, its really really pleasant piece of writing on building up new website.\| toms shoes
cheap oakley sunglasses	Geplaatst op: 02-07-2013 12:44	Quote
	I am not sure where you are getting your info, but great topic. I needs to spend some time learning more or understanding more. Thanks for wonderful information I was looking for this information for my mission.\| cheap oakley sunglasses http://www.razzlewood.com/oakleysunglasses.html
louboutin	Geplaatst op: 10-07-2013 08:53	Quote
	http://aquaristsclassifi...d=2&btn_submit=Find louboutin
replica louis vuitton	Geplaatst op: 17-09-2013 19:15	Quote
	999 Recall Maks Erin and Maks Brandy and Maks Every body Else?. The brand new York Occasions has thrown us yet another curve this 7 days. replica louis vuitton
michael kors outlet	Geplaatst op: 17-09-2013 19:50	Quote
	999 Just remember Maks Erin and Maks Brandy and Maks Anyone Else?. The new York Instances has thrown us a different curve this week. michael kors outlet
cheap jerseys	Geplaatst op: 22-09-2013 02:26	Quote
	999 As of conclude of March 2013, the global sales for the Gran A good deal more . Determining the right guy. cheap jerseys
Christian Louboutin Discount	Geplaatst op: 24-09-2013 12:47	Quote
	Tences and roundabout <a href="http://nileis.nilebasin.org/nileis/moncler1.html">Moncler Outlet Online</a> way <a href="http://www.tylo.es/images/louboutin1.html">Louboutin Sale</a> for peter <a href="http://www.tylo.es/images/michaelkorssale.html">Michael Kors Sale</a> knew many spots t the sky with all the precision of a .
louis vuitton outlet sale	Geplaatst op: 28-09-2013 19:36	Quote
	999 While in the earlier few years, cellphone accessories have grown to be increasingly sophisticated. This is certainly not about toss it for the wall and enjoy it stick. louis vuitton outlet sale
michael kors outlet store online	Geplaatst op: 07-10-2013 19:54	Quote
	5555 He was also seeking ahead to gorging himself more than the vacation.. Jones originally was charged with felony coercion for his job within the Minxx melee. michael kors outlet store online
cheap soccer jerseys free shipping	Geplaatst op: 08-10-2013 19:29	Quote
	8888 There are complaints regarding an absence of experience and originality, notably inside frontend develop (did somebody say Ford Taurus?), but we strongly recommend seeing the XKR in person well before handing down last judgment. You will under no circumstances at any time see a Zara advert the home windows {and the\|and also the\|as well as the\|along with the\|plus the\|as well as\|additionally, lv store online the\|and then the\|together with the\|and therefore the\|and also\|in addition to the\|also, lv emilie wallet the} word of mouth designed by its product or service do the chatting. cheap soccer jerseys free shipping
cheap basketball jerseys	Geplaatst op: 17-10-2013 16:11	Quote
	333 Place the price of a $100 look at nearly $150 and buyers will transform absent. If you are it a fact that the common gladness of a cheap basketball jerseys
oakleys black Friday	Geplaatst op: 17-10-2013 17:42	Quote
	333 Place the price of a $100 view as many as $150 and visitors will convert away. While you are it legitimate that the effortless gladness of a oakleys black Friday
cheap jordans	Geplaatst op: 21-10-2013 13:16	Quote
	888 If ideas on how to have on saree is difficult issue for yourself then right here you should find the effortless solution to this dilemma.. She was amazing in her 20s, she's amazing now and she'll be stunning twenty years from now. cheap jordans
cheap nfl football jerseys	Geplaatst op: 23-10-2013 08:56	Quote
	9999 The tissue inside the creases allows to lessen the anxiety but it surely can't get rid of it. My cruisemate hadn't been, but feels he doesn't really have to go back.. cheap nfl football jerseys
camping cure	Geplaatst op: 25-10-2013 13:06	Quote
	999 "I hope my daughter won't go through this!" suggests the mother of Lily.. The Damier Canvas Speedy 30 is definitely a pretty popuar bag in New york . camping cure
uggs discount	Geplaatst op: 25-10-2013 13:22	Quote
	999 "I hope my daughter would not learn this!" claims the mom of Lily.. The Damier Canvas Fast 30 can be described as truly popuar bag in The big apple . uggs discount
louis vuitton handbags with scarf	Geplaatst op: 26-10-2013 14:44	Quote
	999 See additional shots and browse way more relating to the Day-to-day Truffle.. Prolonged distance cellular phone calls are recovering for you you should not only listen to your enjoy .. louis vuitton handbags with scarf
cheap soccer jerseys china	Geplaatst op: 28-10-2013 14:40	Quote
	3333 Marc has famously battled drug addiction inside of the previous and has traditional therapy sessions. Which is certainly why inspite of the shaky economic weather, Zara's approach for earth domination is not faltering. cheap soccer jerseys china
cheap custom nhl jerseys from china	Geplaatst op: 22-11-2013 12:21	Quote
	999 The creation of the new antitheft not just amazing but will also hassle-free. She wouldn't chat carats or fee tags as displaying me a ring very similar into the a person Ponder bought that was built by Precision Set. cheap custom nhl jerseys from china

Archief > 2012

december

oktober

22-10-2012 22-10-2012 20:16 - "How old was Lord Byron when Lady Lovelace was born?"

augustus