How IBM's Watson Answers a Question, a brief intro
The victory of IBM's Watson over some of the finest Jeopardy! contestants on February 14 and 15, 2011 was a major event in the history of natural language processing. Let's have a look at how Watson works!
Watson was built as a showcase of IBM's capabilities, as a successor to Deep Blue, the chess computer that beat Garry Kasparov in 1997. It was built in 4 years time by a team of about 20 people, some of whom were internationally renowned experts in the field of NLP.
It is probably a good idea to watch the actual Jeopardy! game first. To get an idea what the fuzz is all about, and to get a glimpse of the monumental task Watson was up against. It is YouTubed in two parts:
While we're at it, if you want to read about the origin of Watson and the problems its team encountered when building it, you may want to read this book. It's very enjoyable (but gives very little detail about the technical side of the program):
Final Jeopardy: The Story of Watson, the Computer That Will Transform Our World - Stephen Baker, 2012
What kind of system is Watson?
Watson, or DeepQA as it is called by the development team, is a Question Answer system. It is very good at searching for specific entities through large amounts of unstructured text documents based on a question in natural language and providing the confidence it has in each of the answers it finds. It also integrates structured sources like triple stores and lookup tables. It is able to combine information of multiple sources of information to answer a single question. It handles nested subquestions and certain questions that require problem solving. At the time of the Jeopardy! event it just handled English questions.
It is a very complex system. Almost literally, every technique available from literature was tried and integrated in this machine. Next to the answering machine that contains countless algorithms and manually added (Prolog) rules, it consists of many different machine learning processes, content crawlers, and uses several different types if database engines. It is a distributed system that And then there is the game engine that knows the strategies of playing Jeopardy! at the highest level.
From an architectural view, it has a pipeline (or pipe-and-filter) structure that consists of several serial processes. Next to this, it is also an example of a master-slave architecture that uses thousands of processor cores to search documents and process results in parallel. To this end, Watson uses Apache UIMA.
Now to the technical side
Let's start with a great explanation by David Ferrucci, Watson's principal investigator.
In this AAAI article, he and his colleages also give an overview of the project:
Ferrucci, D, et al. (2010) "Building Watson: An Overview of the DeepQA Project"
IBM Research has published a great deal on Watson's structure. This is where I got my information, most of which can be downloaded, either directly or elsewhere on the web.
I will just give a brief introduction:
In its essence, Watson executes the following "pipeline" as it answers a question:
- Question Analysis: extract information from the question
- Hypothesis Generation: use this information to search for possible answers (candidate answers, or hypotheses)
- Hypothesis and evidence scoring: determine the confidence on each hypothesis: how well does it match the question?
- Final merging and ranking: merge and rank the hypotheses based on the confidence values
The pipeline is further complicated when the question contains subqueries that may be performed in parallel or that are nested. A separate framework is started then. This framework uses the main pipeline for each subquery.
In this phase Watson is presented with a Jeopardy! question, and its goal is to extract information from it. A question contains a category and a clue. For example:
Category: POETS & POETRY
Clue: He was a bank clerk in the Yukon before he published "Songs of a Sourdough" in 1907.
Clue: Invented in the 1500s to speed up the game, this maneuver involves two pieces of the same color.
The original question is given in all upper case so it needs to be given upper and lower case letters. This is important to understand proper names.
Proper names are detected using named entity recognition.
It is then parsed using an English Slot Grammar (ESG) parser. A slot grammar is a dependency grammar that provides explicit slots for each of the clauses in a sentence.
The predicate-argument structure (PAS) builder collapses the two different dimensions of ESG structure (deep structure and surface structure) into one.
Here's an example of PAS output:
lemma(3,"Songs of a Sourdough").
Relation detection is then performed. It turns sentence-like structures into database-like relations. Here are some example:
publish(e1, he, "Songs of a Sourdough")
in(e2, e1, 1907)
authorOf(focus, "Songs of a Sourdough")
temporalLink(publish(. . .), 1907)
altName(focus, Flavian Amphiteatre)
anniversaryOf(this explorer's arrival in India, 400, May 1898)
rdfTriple(buy, U.S., focus)
Based on this analysis, the focus and the LAT are determined. The focus is the part of the question that is a reference to the answer. For example, in "They’re the two states you could be reentering if you’re crossing Florida’s northern border." the focus is "They". It can be replaced by the answer and still yield a valid sentence. The LAT (Lexical Answer Type) constrains the types of answers. Examples are: country, president, poet. Possible answers will be checked against this LAT.
The question is classified as either a standard question or a special question. A standard question is a factoid question that can be solved by search alone. Special questions come in several flavors:
Puzzle: ASTRONOMICAL RHYME TIME: Any song about earth’s natural satellite.
(Answer: "Moon tune")
Multiple Choice: BUSY AS A BEAVER: Of 1, 5, or 15, the rough maximum number of
minutes a beaver can hold its breath underwater. (Answer: "15")
Common Bond: BOND, COMMON BOND: Fan, form, poison pen.
Fill-in-the-Blank (FITB): SCHOOL OF ROCK: History: Herman’s Hermits hit number 1
in 1965 singing "I’m" this man, "I Am." (Answer: "Henry VIII")
Constraint Bearing: ARE YOU A FOOD"E"?: Escoffier says to leave them in their shells
& soak them in a mixture of water, vinegar, salt, and flour. (Answer: "Escargots")
Pun bearing: HIP-HOP ON POP: From this "M.D.": "It’s like this & like that & like this
& uh, its like that & like this & like that & uh" (Answer: "Dr. Dre")
The question is searched for answer hints. For example, in the question: (category: SLOGANEERING) "In 1987 using the slogan "Catch the wave", Max Headroom pitched this "new" soft drink." The answer will likely contain the word "new". (Answer: "New Coke")
Some questions are built from two or more subquestions. These are handled by a separate framework that I will describe later.
- Lally, A. et al (2012) "Question analysis: How Watson reads a clue"
- McCord, M.C. et al (2012) "Deep parsing in Watson"
- Wang, C. et al (2012) "Relation extraction and scoring in DeepQA"
In this phase Watson uses QA extracted information to find candidate answers (or: hypotheses).
The DeepQA team emphasises that Watson does not Type-and-Generate. That means, it does not search for answers within the expected answer type (LAT). This is found to be too brittle. In stead, all data sources are searched looking for combinations of named entities. In the following phase the results will be checked for the correct type. This is called Generate-and-Type.
Watson makes heavy use of Wikipedia, for three reasons. One, about 95% of Jeopardy! questions' answers occur as Wikipedia titles. Two, Wikipedia allows for fast document searches with the named entities as input, and the title of the article can be used as a candidate answer. And three, some of the main characteristics of a Wikipedia article are stored in a structured data source, DBPedia. This allows Watson to check for correctness.
Wikipedia is not the only source of information. It also uses other encyclopedia's, dictionaries, newswire articles, triple stores, and lookup tables. And it also creates its own title-based pseudocuments by intelligently processing the results of web search queries.
Searches are not done online, by the way, all texts are stored locally, and indexed by a search engine like Lucene.
An exception to Generate-and-Type is made for questions with closed LATs. This means that the answer contains a small set of possible answers, all of whom are available in a set. Examples: countries, presidents-of-the-US. In this case all possible values are taken to be candidate answers.
Some answers are calculated. Special algorithms exist for a small set of mathematical questions. For example: "It’s the square root of the square root of 16." (answer: "2")
Somewhere in Watson's pipeline, profanities are filtered out. Probably at this location. IBM would be very displeased by a cursing and foul-mouthed Watson. In a test round it once used the F-word.
- Chu-Carroll, J. et al (2012) "Finding needles in the haystack: Search and candidate generation"
Hypothesis and evidence scoring
In this phase the candidate answers are scored by confidence.
There are mainly two parts in this phase. In the first one, called soft filtering, Watson checks if the candidate answers match the LAT. This is called Type Coercion (TyCor). For example: when the LAT is "US town", does "Toronto" match? (No) Candidate answers are not dropped completely if they don't match, their confidence score just lowers. That is why it is called soft filtering. Watson needs this because natural language allows for analogies, puns, etc. that make it impossible to exclude any answer completely.
There are about 10 TyCor components (in 2012). These are Yago, Intro, Gender, ClosedLAT, Lexical, Identity, NED, WordNet, WikiCat, List, and Prismatic. In each of these components both the candidate answer and the LAT are interpreted into the local jargon of the knowledge base, and then it checks if the type of one is a subtype of the other.
In the second part, the candidate answer is checked against all constraints that the question (and its category) impose upon it. In the question that starts with "This country that shares its boundary with Argentina. . ." all countries may be returned by the Hypothesis Generation phase. In the evidence phase that follows it all countries that do not match the relation `bordersOn(focus, X)` will receive lower confidence.
In 2012 Watson had about a hundred scorers that might or might not apply in each question. Each one would pass a confidence that a candidate answer was suited to be the actual answer. Among these were: the degree of match between a passage’s predicate-argument structure and the question, passage source reliability, geospatial location, temporal relationships, taxonomic classification, the lexical and semantic relations the candidate is known to participate in, the candidate’s correlation with question terms, its popularity (or obscurity), its aliases, and so on.
- Murdock, J. et al (2012) "Typing candidate answers using type coercion"
- Murdock, J. et al (2012) "Textual evidence gathering and analysis"
- Wang, C. et al (2012) "Relation extraction and scoring in DeepQA"
- Welty, C. et al (2012) "A Comparison of Hard Filters and Soft Evidence for Answer Typing in Watson"
Final merging and ranking
In this phase all candidate answer confidence scores are processed and a ranked list of answers is produced.
The confidence scores from the previous phase are not normalized. A candidate may have been given several high scores in the evidence phase, but these scores may not be very relevant in the current question category. To determine which scores are important under which circumstances, this phase uses a machine learning framework that is constantly being trained.
The technique used is that of regularized logistic regression. I have to admit that I have no idea how this works. It is the best of many techniques the team tried before it.
Similar answers are merged in this phase. "John Fitzgerald Kennedy" and "JFK" may have resulted from the previous phase. This phase tries to recognize these different names for the same entity. If so, their evidence may be combined.
Watson also detects and stores "more specific" relations between candidate answers. In the question (category: MYTHING IN ACTION): "One legend says this was given by the Lady of the Lake & thrown back in the lake on King Arthur’s death." the first answer is "sword". When prompted to be more specific, Watson is able to come up with "Excalibur".
Hitlist normalization: drop candidates that have very low scores.
Evidence diffusion: if a question results in two answers (e.g. Pyongyang and North-Korea) and the LAT asks for a country, but the evidence is supportive for the city (Pyongyang), it may be transferred to the country (North-Korea) if there is a semantic relation between the two (here: `locatedIn`) that justifies this evidence transference.
- Gondek, D.K. et al (2012) "A framework for merging and ranking of answers in DeepQA"
Some questions can be split up into subquestions. These are either parallel or nested. Watson has a special framework for subquestions.
First subquestions are recognized. Then they are rewitten into independent questions and offered anew to the Question Answer component.
For example: the question (category: HISTORIC PEOPLE) "The life story of this man who died in 1801 was chronicled in an A&E Biography DVD titled "Triumph and Treason"." can be split into these subquestions:
Q1: This man who died in 1801.
Q2: The life story of this man was chronicled
These are then rewritten to these independent questions:
Q1: (category: HISTORY PEOPLE) "A&E Biography DVD "Triumph and Treason"):
This man who died in 1801."
Q2: (category: HISTORY PEOPLE, 1801): "The life story of this man was chronicled
in an A&E Biography DVD titled "Triumph and Treason"."
Note that the remaining information of the original sentence is added to the subquestion as well, between brackets. This is a contextual hint that helps retrieving most relevant information first.
- Kalyanpur, A. et al (2012) "Fact-based question decomposition in DeepQA"
The research team also provides some articles about the handling of specific types of questions.
- Chu-Carroll, J. et al. (2012) "Identifying implicit relationships"
- Prager, J.M. et al. (2012) "Special Questions and techniques"
Watson uses structured and unstructured sources. Most of these were preexisting, but one source was created for the purpose of Watson explicitly: PRISMATIC. It collects meaningful relations and statistics that it extracts from a large amount of documents. All knowledge sources are used both in hypthoses generation and candidate scoring.
- Chu-Carroll, J. et al (2012) "Textual resource acquisition and engineering"
- Fan, J. et al (2012) "Automatic knowledge extraction from documents"
- Kalyanpur, A. et al (2012) "Structured data and inference in DeepQA"
Studying Watson is fun and interesting, because it is about a real challenge: to build a machine that could beat Jeopardy!, a game that requires an extraordinary amount of knowledge, languages skills, and speed. Watson combines many techniques, both rule-based and statistical, that could be used in other projects as well. Many members of the team have done their best to explain what they've done. The articles are lightened up with examples from the game Jeopardy! They are written in a lively manner, with many interesting details and illuminating diagrams. Watson truly is a landmark piece of natural language processing and of artificial intelligence in general.