Implementation of an Earley parser with unification

Main content

Implementation of an Earley parser with unification

maandag 05 december 2011 20:50

I just completed a PHP implementation of the Earley parser, extended with unification, that is described in the book "Speech and Language Processing" by Jurafsky and Martin. I thought it might be a good idea to share my code since it took me quite some time and effort to create it and, what's more important, I could not find any other implementation of it on the internet.

The Earley algorithm is a top-down chart parser that parses natural language and does this very efficiently. Given a grammar that contains syntax rules and the words of a language, it parses a sentence into all possible syntax trees. The unification extension adds further restrictions to the syntax rules. An example of such a restriction is that the rule S => NP VP only applies if the number and person of the NP agrees with that of the VP. When NP is "The children", then the VP may be "are playing", but cannot be "is playing" (a restriction on number) or "am playing" (a restriction on person). The grammar should provide extra feature structures for the syntax rules. The unification extension of the Earley parser applies these structures and ensures that a sentence with a feature error will not be parsed.

I just provide the code to give you some idea and hints on to implement it. The code has not been tested very much since I just finished it. I place it under MIT license.

Earley parser source code

The algorithm may be found in the book "Speech and language Processing" (first edition), chapter 11, page 431. In the second edition of the book the algorithm may be found in chapter 15. The rest of this article presumes you have read it, since I cannot quickly repeat all the information given in chapter 11.

The book first describes the basic Earley algorithm, which can be found on Wikipedia. Some implementations are available there as well. It is described in chapter 10 of the book but can be hard to really understand it, because of the undescriptive use of variable names like A and B, and that it uses two functions that are not specified, NEXT-CAT and PARTS-OF-SPEECH.

Syntax rules and feature structures

For the basic algorithm I use syntax rules like the following S => NP VP

array(
    array('cat' => 'S'),
    array('cat' => 'NP'),
    array('cat' => 'VP'),
),

The same rule, extended with feature structures looks like this:

array(
    array('cat' => 'S', 'features' => array('head-1' => null)),
    array('cat' => 'NP', 'features' => array('head' => array('agreement-2' => null))),
    array('cat' => 'VP', 'features' => array('head-1' => array('agreement-2' => null))),
),

This implements the feature structure that is described on page 428 of the book:

Features of the implementation

The implementation of the algorithm handles unification for syntax rules like this,

VP => NP NP

as described on the bottom of page 406. The rule as two identical consequents. If the grammatical categories of the rule are implemented as keys in a hashtable (as they are in my code) then they need to be discerned. I do that by adding the position of the category in the rule.

The book says: VP => NP(1) NP(2)

in the implementation it looks like this: VP@0 => NP@1 NP@2 (but you don't need to think about this when you write the syntax rules)

It also deals with rules like this NP => NP PP where one consequent is identical to the antecedent.

Remarks on the algorithm

The meaning of the function "FOLLOW-PATH" is not described in the book, but I presume that it means that it returns a subset of its argument. It starts out with the label "cat" and follows it to the end.

The function "UNIFY-STATES" of the original algorithm performed a "FOLLOW-PATH" of "dag2". I think this is an error in the algorithm, because it would corrupt the feature set of the charted state.

My version of unifyStates is also somewhat different since I take into account that a rule may have the same category at different positions (as in the VP => NP NP example).

Labeled DAGs

Unifying DAGs is a very elegant to implement unification. I could not follow the explanation of the unification process as described on page 423, unfortunately. That's why I decided to create a different implementation of this function. I introduced a class to separate the logic of the datastructure: LabeledDAG, a labeled directed acyclic graph (or DAG). Which reminds me to thank my employer at Procurios, Bert Slagter, for inadvertently providing me with a perfect datastructure for this task.

Finally

I also added the possibility to stop parsing when the first complete sentence is found. For some applications you are only interested in a single interpretation; it might as well be the first one, especially if you succeed to order your syntax rules in such a way that the best interpretation is found first.

Labels: nlp

« Terug

Reacties op 'Implementation of an Earley parser with unification'

1 2		Laatste pagina
Jerry	Geplaatst op: 20-10-2012 10:25	Quote
	Hi, Nice implementation of the Earley algorithm. I just started playing with it and I am having some trouble building an actual Grammer class. Do you have an example available?
Patrick van Bergen	Geplaatst op: 20-10-2012 20:55	Quote
	Hello Jerry, Good to hear you are interested in the code! I took me some time to get a small example grammar and small test application ready. But there it is, in the new zip-file.
Jerry	Geplaatst op: 21-10-2012 09:51	Quote
	Thanks Patrick, I appreciate your quick response! I'll start experimenting with your example grammar.
toms shoes	Geplaatst op: 26-06-2013 15:41	Quote
	Hey would you mind sharing which blog platform you're using? I'm going to start my own blog in the near future but I'm having a tough time deciding between BlogEngine/Wordpress/B2evolution and Drupal. The reason I ask is because your design seems different then most blogs and I'm looking for something unique. <a href="http://www.ideavelopers.com/tomsshoes.php">www.ideavelopers.com/tomsshoes.php</a> P.S Apologies for getting off-topic but I had to ask!\| toms shoes http://www.ideavelopers.com/tomsshoes.php
toms outlet	Geplaatst op: 02-07-2013 12:44	Quote
	Thank you for the auspicious writeup. It in fact was a amusement account it. Look advanced to far added agreeable from you! However, <a href="http://www.ideavelopers.com/tomsshoes.php">toms outlet</a> how can we communicate?\| toms outlet
michael kors	Geplaatst op: 10-07-2013 08:54	Quote
	Excellent site you have got here.. It's difficult to find excellent writing like yours nowadays. I really appreciate individuals like you! Take care!!\| michael kors
michael kors outlet	Geplaatst op: 10-07-2013 08:54	Quote
	I got this web page from my pal who shared with me on the topic of this website and now this time I am visiting this web page and reading very informative content at this time.\| michael kors outlet http://www.ce411.com/michaelkorsoutlet.php
Yaron	Geplaatst op: 11-07-2013 08:14	Quote
	There are a few cases where you'd want to leave that code in. * It's temporarily dilbased, but intended to be brought back. Having deleted code in your history is not going to remind you* It's there as a warning example. I.e. we tried this, but it doesn't work because... (You can't always get away w/ just describing the algorithm)The golden rule here is, if you have to do that, also leave a comment WHY you have that code there. If there's no comment, I habitually delete commented out code. If you can't tell me why it's there, it doesn't belong.
Yaron	Geplaatst op: 11-07-2013 08:14	Quote
	There are a few cases where you'd want to leave that code in. * It's temporarily dilbased, but intended to be brought back. Having deleted code in your history is not going to remind you* It's there as a warning example. I.e. we tried this, but it doesn't work because... (You can't always get away w/ just describing the algorithm)The golden rule here is, if you have to do that, also leave a comment WHY you have that code there. If there's no comment, I habitually delete commented out code. If you can't tell me why it's there, it doesn't belong.
Yaron	Geplaatst op: 11-07-2013 08:14	Quote
	There are a few cases where you'd want to leave that code in. * It's temporarily dilbased, but intended to be brought back. Having deleted code in your history is not going to remind you* It's there as a warning example. I.e. we tried this, but it doesn't work because... (You can't always get away w/ just describing the algorithm)The golden rule here is, if you have to do that, also leave a comment WHY you have that code there. If there's no comment, I habitually delete commented out code. If you can't tell me why it's there, it doesn't belong.
Samsam	Geplaatst op: 12-07-2013 23:29	Quote
	I think I came up with basically the same<a href="http://jljwtilyvv.com"> aigrolthm</a> as hasan, but it doesn't require preprocessing for range max queries, and maybe it will be simpler to understand (or not).The idea is to maintain subarrays which are multipermutations, not necessarily maximal - which I'll call "green" subarrays - and subarrays surrounding each green one which must be contained in any larger multipermutation - I'll call these "yellow" subarrays. Keep growing these (occasionally turning yellow subarrays green when there is an actual multipermutation) until the whole array is covered either yellow or green, and then the largest green one is the answer.Call the array A with entries A[1]..A[n].I'll first present the case where there is only one "1" in the array. Note that any multipermutation must contain this "1", and also that it is itself a length-1 multipermutation. Say this "1" occurs at index k in the array. Then, for i=2..n, store the following:-L, the largest integer j<k such that A[j]=i-LM := max(A[L..k])-R, the smallest integer j>k such that A[j]=i-RM := max(A[k..R])Clearly, these four arrays can be computed in linear time in a single pass left and right from index k.We will maintain a single "green" subarray, which is itself a subarray of a single "yellow" subarray. Initially, these are both just the single "1" at A[k]. Also maintain:-The largest integer in the yellow subarray, initially 1; call it a.-The smallest integer not present in the yellow subarray, initially 2; call it bNow repeat the following steps:-(Grow yellow subarray) Compare LM and RM. If LM is smaller, then grow the yellow subarray to the left to index L - which must not already be in the yellow subarray by the definition of b. Update a and b as each new element is added (updating b in constant time can be done with some more bookkeeping). If RM is smaller, do the opposite. If LM=RM, then grow in both directions.-(Check for multipermutation) If b>a, change the whole yellow subarray to green and expand in both directions to include all adjacent integers a.This process terminates after at most n steps (any number larger than n can't be part of a multipermutation), and the amortized cost over all steps is O(n), since the yellow array only grows. At the end, the green array is the largest multipermutation.For the more general case (more than one "1" in the array), we start initially with each "1" being its own green (and yellow) array, and grow them all simultaneously, merging the yellow and then green arrays when they meet. Constructing the arrays becomes a bit trickier to do in linear time, but is possible by using the fact that any multipermutation on a sub-interval of length m can only contain integers m.
Samsam	Geplaatst op: 12-07-2013 23:29	Quote
	I think I came up with basically the same<a href="http://jljwtilyvv.com"> aigrolthm</a> as hasan, but it doesn't require preprocessing for range max queries, and maybe it will be simpler to understand (or not).The idea is to maintain subarrays which are multipermutations, not necessarily maximal - which I'll call "green" subarrays - and subarrays surrounding each green one which must be contained in any larger multipermutation - I'll call these "yellow" subarrays. Keep growing these (occasionally turning yellow subarrays green when there is an actual multipermutation) until the whole array is covered either yellow or green, and then the largest green one is the answer.Call the array A with entries A[1]..A[n].I'll first present the case where there is only one "1" in the array. Note that any multipermutation must contain this "1", and also that it is itself a length-1 multipermutation. Say this "1" occurs at index k in the array. Then, for i=2..n, store the following:-L, the largest integer j<k such that A[j]=i-LM := max(A[L..k])-R, the smallest integer j>k such that A[j]=i-RM := max(A[k..R])Clearly, these four arrays can be computed in linear time in a single pass left and right from index k.We will maintain a single "green" subarray, which is itself a subarray of a single "yellow" subarray. Initially, these are both just the single "1" at A[k]. Also maintain:-The largest integer in the yellow subarray, initially 1; call it a.-The smallest integer not present in the yellow subarray, initially 2; call it bNow repeat the following steps:-(Grow yellow subarray) Compare LM and RM. If LM is smaller, then grow the yellow subarray to the left to index L - which must not already be in the yellow subarray by the definition of b. Update a and b as each new element is added (updating b in constant time can be done with some more bookkeeping). If RM is smaller, do the opposite. If LM=RM, then grow in both directions.-(Check for multipermutation) If b>a, change the whole yellow subarray to green and expand in both directions to include all adjacent integers a.This process terminates after at most n steps (any number larger than n can't be part of a multipermutation), and the amortized cost over all steps is O(n), since the yellow array only grows. At the end, the green array is the largest multipermutation.For the more general case (more than one "1" in the array), we start initially with each "1" being its own green (and yellow) array, and grow them all simultaneously, merging the yellow and then green arrays when they meet. Constructing the arrays becomes a bit trickier to do in linear time, but is possible by using the fact that any multipermutation on a sub-interval of length m can only contain integers m.
Samsam	Geplaatst op: 12-07-2013 23:29	Quote
	I think I came up with basically the same<a href="http://jljwtilyvv.com"> aigrolthm</a> as hasan, but it doesn't require preprocessing for range max queries, and maybe it will be simpler to understand (or not).The idea is to maintain subarrays which are multipermutations, not necessarily maximal - which I'll call "green" subarrays - and subarrays surrounding each green one which must be contained in any larger multipermutation - I'll call these "yellow" subarrays. Keep growing these (occasionally turning yellow subarrays green when there is an actual multipermutation) until the whole array is covered either yellow or green, and then the largest green one is the answer.Call the array A with entries A[1]..A[n].I'll first present the case where there is only one "1" in the array. Note that any multipermutation must contain this "1", and also that it is itself a length-1 multipermutation. Say this "1" occurs at index k in the array. Then, for i=2..n, store the following:-L, the largest integer j<k such that A[j]=i-LM := max(A[L..k])-R, the smallest integer j>k such that A[j]=i-RM := max(A[k..R])Clearly, these four arrays can be computed in linear time in a single pass left and right from index k.We will maintain a single "green" subarray, which is itself a subarray of a single "yellow" subarray. Initially, these are both just the single "1" at A[k]. Also maintain:-The largest integer in the yellow subarray, initially 1; call it a.-The smallest integer not present in the yellow subarray, initially 2; call it bNow repeat the following steps:-(Grow yellow subarray) Compare LM and RM. If LM is smaller, then grow the yellow subarray to the left to index L - which must not already be in the yellow subarray by the definition of b. Update a and b as each new element is added (updating b in constant time can be done with some more bookkeeping). If RM is smaller, do the opposite. If LM=RM, then grow in both directions.-(Check for multipermutation) If b>a, change the whole yellow subarray to green and expand in both directions to include all adjacent integers a.This process terminates after at most n steps (any number larger than n can't be part of a multipermutation), and the amortized cost over all steps is O(n), since the yellow array only grows. At the end, the green array is the largest multipermutation.For the more general case (more than one "1" in the array), we start initially with each "1" being its own green (and yellow) array, and grow them all simultaneously, merging the yellow and then green arrays when they meet. Constructing the arrays becomes a bit trickier to do in linear time, but is possible by using the fact that any multipermutation on a sub-interval of length m can only contain integers m.
Choudhary	Geplaatst op: 14-07-2013 15:00	Quote
	Hello Anna,I love your beautiful blog and check in alosmt every day. The cloches are wonderful. I've developed a cloche addiciton lately, I'm afraid, and have collected quite a few. I love the way you have yours displayed.Many thanks for sharing the lovely work you do.Georganna http://xzbalju.com nctqlfvotfi [link=http://mpggmmnfdi.com]mpggmmnfdi[/link]
Choudhary	Geplaatst op: 14-07-2013 15:00	Quote
	Hello Anna,I love your beautiful blog and check in alosmt every day. The cloches are wonderful. I've developed a cloche addiciton lately, I'm afraid, and have collected quite a few. I love the way you have yours displayed.Many thanks for sharing the lovely work you do.Georganna http://xzbalju.com nctqlfvotfi [link=http://mpggmmnfdi.com]mpggmmnfdi[/link]
Choudhary	Geplaatst op: 14-07-2013 15:00	Quote
	Hello Anna,I love your beautiful blog and check in alosmt every day. The cloches are wonderful. I've developed a cloche addiciton lately, I'm afraid, and have collected quite a few. I love the way you have yours displayed.Many thanks for sharing the lovely work you do.Georganna http://xzbalju.com nctqlfvotfi [link=http://mpggmmnfdi.com]mpggmmnfdi[/link]
Louis Vuitton Outlet	Geplaatst op: 23-07-2013 20:27	Quote
	Take the time to reheat your food correctly and you be rewarded Louis Vuitton Outlet
Bardo	Geplaatst op: 28-08-2013 08:11	Quote
	http://www.shoppingforedtabs.com/ http://www.getyourquotesonline.com/ http://www.insurdealsnow.com/ http://www.bestautoinsuranceservice.com/ http://bestcollegeonline.net/
Rocky	Geplaatst op: 04-09-2013 16:37	Quote
	http://autoinsuranceguidance.net/ http://psychopoeia.com/ http://www.cheapestcarinsuronline.com/ http://www.comparequotesonline.net/ http://www.mybestonlinecollege.net/ http://undercovercrafter.com/
Dragon	Geplaatst op: 09-09-2013 12:23	Quote
	http://www.getacheapcoverage.com/ http://www.cheapestcarinsuronline.com/ http://www.shoppingforedtabs.com/ http://www.bestautoinsurancepolicies.net/ http://undercovercrafter.com/ http://politics-matters.com/ http://www.mybestonlinecollege.net/ http://www.insurers-list.com/
Cammie	Geplaatst op: 10-09-2013 16:48	Quote
	http://www.aboutcarinsurancerates.com/ http://autoinsuranceguidance.net/ http://www.perfectautoinsur.com/ http://www.cheapestcarinsuronline.com/ http://www.getacheapcoverage.com/ http://www.insurers-guide.com/ http://undercovercrafter.com/ http://www.shoppingforedtabs.com/
louis vuitton outlet	Geplaatst op: 11-09-2013 20:21	Quote
	9999 During the past pair of months I've been asked a variety of situations why I chose tuberculosis as being the subject for my TED want. louis vuitton outlet
Lyddy	Geplaatst op: 19-09-2013 05:13	Quote
	http://www.aboutcarinsurancerates.com/ http://www.insurers-guide.com/ http://psychopoeia.com/ http://www.mybestonlinecollege.net/ http://www.lowpricetreatment.com/ http://undercovercrafter.com/
cheap nfl jerseys	Geplaatst op: 22-09-2013 02:27	Quote
	999 As of finish of March 2013, the worldwide income with the Gran Additional . Determing the best person. cheap nfl jerseys
Destrey	Geplaatst op: 25-09-2013 00:57	Quote
	http://coope.jpn.org/%e6...6%e9%a1%9e/sildenafil-2 http://las.am/lv/e-conte...katu-uz-londonu/reviews http://www.popcornsound.be http://election2013.bohol.info/blog http://www.findfilefine.com/naantrip/
1 2		Laatste pagina