The results are often impressive. DARPA runs annual competitions for the best automated language translation systems for different language pairs, and Google Translate often wins for certain pairs, outperforming systems created directly by human linguists.
Over the past decade two major insights have deeply influenced the natural-language-understanding field. The first has to do with hierarchies. Although the Google approach started with association of flat word sequences from one language to another, the inherent hierarchical nature of language has inevitably crept into its operation. Systems that methodically incorporate hierarchical learning (such as hierarchical hidden Markov models) provided significantly better performance. However, such systems are not quite as automatic to build. Just as humans need to learn approximately one conceptual hierarchy at a time, the same is true for computerized systems, so the learning process needs to be carefully managed.
The other insight is that hand-built rules work well for a core of common basic knowledge. For translations of short passages, this approach often provides more accurate results. For example, DARPA has rated rule-based Chinese-to-English translators higher than Google Translate for short passages. For what is called the tail of a language, which refers to the millions of infrequent phrases and concepts used in it, the accuracy of rule-based systems approaches an unacceptably low asymptote. If we plot natural-language-understanding accuracy against the amount of training data analyzed, rule-based systems have higher performance initially but level off at fairly low accuracies of about 70 percent. In sharp contrast, statistical systems can reach the high 90s in accuracy but require a great deal of data to achieve that.
Often we need a combination of at least moderate performance on a small amount of training data and then the opportunity to achieve high accuracies with a more significant quantity. Achieving moderate performance quickly enables us to put a system in the field and then to automatically collect training data as people actually use it. In this way, a great deal of learning can occur at the same time that the system is being used, and its accuracy will improve. The statistical learning needs to be fully hierarchical to reflect the nature of language, which also reflects how the human brain works.
This is also how Siri and Dragon Go! work—using rules for the most common and reliable phenomena and then learning the “tail” of the language in the hands of real users. When the Cyc team realized that they had reached a ceiling of performance based on hand-coded rules, they too adopted this approach. Hand-coded rules provide two essential functions. They offer adequate initial accuracy, so that a trial system can be placed into widespread usage, where it will improve automatically. Secondly, they provide a solid basis for the lower levels of the conceptual hierarchy so that the automated learning can begin to learn higher conceptual levels.
As mentioned above, Watson represents a particularly impressive example of the approach of combining hand-coded rules with hierarchical statistical learning. IBM combined a number of leading natural-language programs to create a system that could play the natural-language game of Jeopardy! On February 14–16, 2011, Watson competed with the two leading human players: Brad Rutter, who had won more money than anyone else on the quiz show, and Ken Jennings, who had previously held the Jeopardy! championship for the record time of seventy-five days.
By way of context, I had predicted in my first book, The Age of Intelligent Machines, written in the mid-1980s, that a computer would take the world chess championship by 1998. I also predicted that when that happened, we would either downgrade our opinion of human intelligence, upgrade our opinion of machine intelligence, or downplay the importance of chess, and that if history was a guide, we would minimize chess. Both of these things happened in 1997. When IBM’s chess supercomputer Deep Blue defeated the reigning human world chess champion, Garry Kasparov, we were immediately treated to arguments that it was to be expected that a computer would win at chess because computers are logic machines, and chess, after all, is a game of logic. Thus Deep Blue’s victory was judged to be neither surprising nor significant. Many of its critics went on to argue that computers would never master the subtleties of human language, including metaphors, similes, puns, double entendres, and humor.
The accuracy of natural-language-understanding systems as a function of the amount of training data. The best approach is to combine rules for the “core” of the language and a data-based approach for the “tail” of the language.
That is at least one reason why Watson represents such a significant milestone: Jeopardy! is precisely such a sophisticated and challenging language task. Typical Jeopardy! queries includes many of these vagaries of human language. What is perhaps not evident to many observers is that Watson not only had to master the language in the unexpected and convoluted queries, but for the most part its knowledge was not hand-coded. It obtained that knowledge by actually reading 200 million pages of natural-language documents, including all of Wikipedia and other encyclopedias, comprising 4 trillion bytes of language-based knowledge. As readers of this book are well aware, Wikipedia is not written in LISP or CycL, but rather in natural sentences that have all of the ambiguities and intricacies inherent in language. Watson needed to consider all 4 trillion characters in its reference material when responding to a question. (I realize that Jeopardy! queries are answers in search of a question, but this is a technicality—they ultimately are really questions.) If Watson can understand and respond to questions based on 200 million pages—in three seconds!—there is nothing to stop similar systems from reading the other billions of documents on the Web. Indeed, that effort is now under way.
When we were developing character and speech recognition systems and early natural-language-understanding systems in the 1970s through 1990s, we used a methodology of incorporating an “expert manager.” We would develop multiple systems to do the same thing but would incorporate somewhat different approaches in each one. Some of the differences were subtle, such as variations in the parameters controlling the mathematics of the learning algorithm. Some variations were fundamental, such as including rule-based systems instead of hierarchical statistical learning systems. The expert manager was itself a software program that was programmed to learn the strengths and weaknesses of these different systems by examining their performance in real-world situations. It was based on the notion that these strengths were orthogonal; that is, one system would tend to be strong where another was weak. Indeed, the overall performance of the combined systems with the trained expert manager in charge was far better than any of the individual systems.
Watson works the same way. Using an architecture called UIMA (Unstructured Information Management Architecture), Watson deploys literally hundreds of different systems—many of the individual language components in Watson are the same ones that are used in publicly available natural-language-understanding systems—all of which are attempting to either directly come up with a response to the Jeopardy! query or else at least provide some disambiguation of the query. UIMA is basically acting as the expert manager to intelligently combine the results of the independent systems. UIMA goes substantially beyond earlier systems, such as the one we developed in the predecessor company to Nuance, in that its individual systems can contribute to a result without necessarily coming up with a final answer. It is sufficient if a subsystem helps narrow down the solution. UIMA is also able to compute how much confidence it has in the final answer. The human brain does this also—we are probably very confident of our response when asked for our mother’s first name, but we are less so in coming up with the name of someone we met casually a year ago.