The problem with proper nouns
Before a sentence is passed to the parser, it is first run through the lexer. The lexer divides the sentence into words or word groups. These are called lexical items and they are the smallest parts of the sentence that have a single part-of-speech. Most lexical items can be looked up in the lexicon. But what about proper nouns?
Proper nouns (or more exactly: proper names), lexical items like "Sandra", "Peter R. de Vries" and "Ramachandra" are not part of the lexicon (dictionary). Therefore they will not be recognized simply by doing a lexicon lookup.
Still, you may be able to recognize some of them by looking them up in a knowledge base.
When was Lord Byron born?
The name "Lord Byron" may be looked up and form a single lexical item with the part-of-speech "proper noun".
[when, was, Lord Byron, born]
But this sentence is different:
My name is Jan de Vries.
The name of the conversation partner cannot be looked up. And this is only logical, because the purpose of the sentence is introduction. And I mean that in two ways: the person introduces himself. And at the same time a new lexical item, "Jan de Vries", is introduced in the mental lexicon of the agent.
How can a lexer recognize a new proper noun when it encounters one? I can think of two ways:
1. The words in the sentence that are not part of the lexicon form the new proper noun.
2. Proper nouns may be recognized because they follow certain patterns.
The first method is simple and applies to all names. However, it runs into several problems.
My name is Jan de Vries.
[my, name, is, Jan, de, Vries]
Since "de" is a word in the lexicon, are "Jan" and "Vries" to be separated into two separate proper nouns? No, because we want to end up with single lexical item. So we may want to take all words in between the unknown words as well. However, which words should we include and which words should we not?
My name is Jan and I like Sandra.
[my, name, is, Jan and I like Sandra]
Why can "Jan and I like Sandra" not be a proper noun? Because it is too long? What about
My Name is Johannes Hubertus van der Laak.
It contains just as many words.
That brings us to the second option: patterns.
The pattern for "Jan de Vries" is
"Ul+ de Ul+" (Ul+ = upper case letter followed by one or more lower case letters)
and the pattern for "Johannes Hubertus van der Laak" is
"Ul+ Ul+ van der Ul+"
Why require the capitals? People often leave them out, don't they? Yes they do. However, if the restriction on capital letters is dropped, the following sentence
john mills the grain.
would be lexed as
[john mills, the, grain]
and this cannot be parsed.
Remember, the lexer has no knowledge of the syntactic structure of the sentence. If the lexer was integrated in the parser, it might be more flexible and offer different possible interpretations. However, this makes the parsing process more complicated and computationally expensive.