I would say the core component of IBM Watson is the Natural Language Processing, without which, surely Watson will not even be able to understand the question, let alone answer. NLP allows Watson to really derive the meaning of the question. How many parts are there in the question and how to interpret meaning and relation between each part. This semantics of the question, obviously, will have direct implication on the correctness of answer for the question.
Natural Language Processing or NLP as its called, is a huge area in artificial intelligence field concerned with Human Computer interactions (HCI). I do not think of myself as knowledgeable on this subject matter but lately I did try understand its depth in respect to my interests in Watson. This blog post simply documents that study to some extent.
ELIZA written at MIT by Joseph Weizenbaum around 1964-66, named after the protagonist in George Bernard Shaw’s Pygmalion, is one of the earliest examples of NLP. ELIZA is in some sorts great grandmother of Apple’s Siri J
The first thing to know to begin with NLP will be the linguistic concepts. Grammar is at the core of any language. Grammar of a language is how sentences are constructed. It is interesting to note history of grammar that, the first systematic grammars originated in Iron Age India, with Yaska (6th century BC), Pāṇini (4th century BC) and his commentators Pingala (c. 200 BC), Katyayana, and Patanjali (2nd century BC). Tolkāppiyam is the earliest Tamil grammar is mostly dated to before the 5th century AD.
In NLP, multiple text processing components are used in sort of a pipeline of tasks performed, in order to provide value from text. Text processing components are tokenization, sentence segmentation, part-of-speech tagging, named entity extraction, chunking, parsing, and co-reference resolution.
Tokenization is breaking up a stream of text into smaller meaningful parts called tokens. Tokens are usually words, phrases or sentences demarcated by punctuation. In English, words are mostly separated from each other by blanks or white space as they are called in IT world. Thing to remember thought is that not all white space is equal e.g. “San Francisco” or “fast food” is supposed to taken as a single word. On other hand “I’m” should be two words “I am”. Further challenges to consider are abbreviations, acronyms, hyphenated words and most importantly numerical and special expressions such as telephone numbers, date format, vehicle license number.
Sentence segmentation is simply put dividing the stream of text into sentences. Let us not jump into conclusion that its basically looking for a period or other end of sentence punctuation. Remember, period is used in many more context and not only end of sentence. A period may denote an abbreviation, decimal point, or an email address. Question marks and exclamation marks may appear in embedded quotations, emoticons, computer code, and slang. Thus, we need specialized algorithms to really identify the end of sentence. In fact, there is special name for such algorithms. Its called Sentence Boundary Detection. I read this very interesting paper by Dan Gillick of UC-Berkley on this. https://code.google.com/p/splitta/ He claims this includes proper tokenization and models for very high accuracy sentence boundary detection (English only for now). The models are trained from Wall Street Journal news combined with the Brown Corpus which is intended to be widely representative of written English. Error rates on test news data are near 0.25%.
Another very interesting component of text processing is part of speech tagging. The basics of a language are the words. Linguist classify the words into various classes or ‘parts of speech’ (POS). In English language, they noun, verb, article, adjective, preposition, pronoun, adverb, conjunction, and interjection. The process of assigning one of the parts of speech to the given word, rather token, is called Parts Of Speech tagging. POS Tagger is an algorithm or program which does POS tagging to tokens. POS Tagger marks tokens with their corresponding word type based on the token itself and the context of the token. The very important part of context of the token. Wiki has a very interesting example in sentence “The sailor dogs the hatch.”. “dogs” is usually thought of as just a plural noun, in this context is a verb. There are variety of algorithms which are used to do POS tagging. Some of them are Viterbi algorithm, Brill Tagger, Constraint Grammar, Baum-Welch algorithm. Methods like Hidden Markov Models or visible Markov model have been used. Many machine learning methods have also been applied to the problem of POS tagging e.g. SVM, Maximum entropy classifier, maximum-entropy Markov model (MEMM), or conditional Markov model (CMM), Perceptron, and Nearest-neighbor.
Further classification by labeling sequences of words in text processing as names of persons, organizations, locations, expressions of times, quantities, monetary values, percentages, etc. is named entity extraction. There are many approaches, linguistic grammar-based techniques as well as statistical models, for Named entity extraction or Named entity recognition (NER), as it is some times called. Conditional random fields (CRFs) are a class of statistical modeling method often applied for NER.
Parsing is the task of analyzing the grammatical structure of natural language. Parser finds which groups of words go together (as “phrases”) and which words are the subject or object of a verb. Parser generates a parse tree of a given sentence. Parse trees are something like the “sentence diagram” we learnt when we were learning English grammar in school. Parse trees can be Constituency-based or Dependency-based. A constituent is a word or a group of words that functions as a single unit within a hierarchical structure in a phrase structure grammar; on other hand, the dependency relation views the (finite) verb as the structural center of all clause structure. All other syntactic units (e.g. words) are either directly or indirectly dependent on the verb.
Chunking is also called shallow parsing and it’s basically the identification of parts of speech and short phrases e.g. noun phrase or verb phrase. Full parsing is expensive, Partial or shallow parsing can be much faster, may be sufficient for many applications and can also serve as a possible first step for full parsing.
Last but not the least, we discuss co-reference resolution. Co-reference resolution is the task of finding all expressions that refer to the same entity in a discourse. E.g. ‘Sheila said she fell down’. Here She is referring to Sheila is the co-reference resolution. There are many types of co-references, like anaphora, cataphora, split antecedents, co-referring noun phrases, etc. Anaphora is when proform follows the expression to which it refers e.g. ‘Sheila said she fell down’ She follows Sheila, to whom it refers to. Cataphora is when proform precedes the expression to which it refers e.g. e.g. ‘she fell down, Sheila said’, She precedes Sheila, to whom it refers to. Algorithms intended to resolve co-references commonly look first for the nearest preceding individual that is compatible with the referring expression. Some are deterministic or multi-pass sieve algorithms.
So these were some of the very basic components of Natural language processing which forms the core of IBM Watson.