How long is a chinaman
I wrote before about the problem with proper nouns. The problem is how to recognize the name of a person in a sentence, when this name is not part of the lexicon. At that point I thought that the problem needed to be solved in the lexer, because when the parser starts, the possible parts-of-speech of all words needs to be known. And I thought that a person's name, like "Johnny Mnemonic" needed to be presented to the parser as a single lexeme, of the category "proper noun".
I ran into two problems with this approach: the first one is that you need to build a small parser inside your lexer in order to recognize proper nouns. This requires you to put a lot of domain knowledge about the structure of proper nouns into code, and that's bad. A NLP engine should always strive for maximum declarative representation of its domain knowledge, so that its users may be able to change it without programming.
The other problem was that it is slow. For every sequential combination of words in a sentence, you need to determine wether it is a proper noun. I added an artifical limit of 5 words, but names may be even longer.
I wanted to solve the problem by creating rewrite rules for proper nouns that could be processed by the parser. For this purpose I needed to introduce a phrase that combined the distinct proper nouns of a name. And I needed to find a part-of-speech for insertions like "de", "van", and "der".
This proved quite simple. The phrase I came up with is "Proper Name (PN)". A proper noun is a single word. A proper name can be a single word, but also multiple words. See Wikipedia
Here's an example of some rewrite rules for proper names:
PN := propernoun insertion propernoun (example: Jan de Wit)
PN := propernoun propernoun propernoun (example: Anne Isabella Milbanke)
PN := propernoun propernoun (example: Johnny Mnemonic)
PN := propernoun (example: Jan)
NP := PN
This may not solve all problems with names, but for the names I am currently using, it suffices.
This solution tries to find proper nouns only where a noun phrase is expected, so that skips a lot of places and is faster.
Which words can be proper nouns?
This leaves us with the question of which words to consider proper nouns. Let's start with a simple rule:
Any word can be a proper noun.
This fails very badly, check the following sentence:
How many children saw the movie?
"How many children" can be parsed as a proper name (propernoun propernoun propernoun). But that's silly isn't it. Why? Because these are just normal words that occur in the lexicon.
So what about this rule:
Any word that is not in the lexicon can be a proper noun.
This is a lot better, but in my integration tests English and Dutch sentences are mixed and the agent will happily start parsing a Dutch sentence as if it was English.
Hoeveel kinderen had Lord Byron?
Now "Hoeveel kinderen" is parsed as a proper name (propernoun propernoun). Maybe your engine uses only texts in a single language and this problem doesn't exist. For me, I solved it by adding the constraint that a proper name must start with a capital.
OK, but what about the perfectly normal sentence:
How Long is a chinaman.
or the statement about the band with the funny name "the The"?
The The is brilliant.
For these names to be recognized, we can add them explicitly to the lexicon, but we must request that they be written with a capital.
The last rule is then:
Words can be proper nouns if they start with a capital letter.
The problem with this rule is that people often leave out the capital letters of proper nouns. For my application this is not a problem, because I require input sentences to be grammatically correct.
But what about this chinaman?
For those of you interested in the original question, Answers.com has the answer:
My father's response was always, "So Long is his sister."
and some guy's reply:
Hate to disagree with your father but since in chinese the family name comes first, How Soon is the name of his close relative.
If you want to distinguish between the given name (Long) and the family name (How), you could use features. A name like "How" is then listed in the lexicon as a proper noun with the feature "familyname = 'How'". The PN will then inherit the features of its children: "givenname = 'Long', familyname = 'How'".