One does not need to be an AI expert to be moved by the performance of Watson on Jeopardy! Although I have a reasonable understanding of the methodology used in a number of its key subsystems, that does not diminish my emotional reaction to watching it—him?—perform. Even a perfect understanding of how all of its component systems work—which no one actually has—would not help you to predict how Watson would actually react to a given situation. It contains hundreds of interacting subsystems, and each of these is considering millions of competing hypotheses at the same time, so predicting the outcome is impossible. Doing a thorough analysis—after the fact—of Watson’s deliberations for a single three-second query would take a human centuries.
To continue my own history, in the late 1980s and 1990s we began working on natural-language understanding in limited domains. You could speak to one of our products, called Kurzweil Voice, about anything you wanted, so long as it had to do with editing documents. (For example, “Move the third paragraph on the previous page to here.”) It worked pretty well in this limited but useful domain. We also created systems with medical domain knowledge so that doctors could dictate patient reports. It had enough knowledge of fields such as radiology and pathology that it could question the doctor if something in the report seemed unclear, and would guide the physician through the reporting process. These medical reporting systems have evolved into a billion-dollar business at Nuance.
Understanding natural language, especially as an extension to automatic speech recognition, has now entered the mainstream. As of the writing of this book, Siri, the automated personal assistant on the iPhone 4S, has created a stir in the mobile computing world. You can pretty much ask Siri to do anything that a self-respecting smartphone should be capable of doing (for example, “Where can I get some Indian food around here?” or “Text my wife that I’m on my way,” or “What do people think of the new Brad Pitt movie?”), and most of the time Siri will comply. Siri will entertain a small amount of nonproductive chatter. If you ask her what the meaning of life is, she will respond with “42,” which fans of The Hitchhiker’s Guide to the Galaxy will recognize as its “answer to the ultimate question of life, the universe, and everything.” Knowledge questions (including the one about the meaning of life) are answered by Wolfram Alpha, described on page 170. There is a whole world of “chatbots” who do nothing but engage in small talk. If you would like to talk to our chatbot named Ramona, go to our Web site KurzweilAI.net and click on “Chat with Ramona.”
Some people have complained to me about Siri’s failure to answer certain requests, but I often recall that these are the same people who persistently complain about human service providers also. I sometimes suggest that we try it together, and often it works better than they expect. The complaints remind me of the story of the dog who plays chess. To an incredulous questioner, the dog’s owner replies, “Yeah, it’s true, he does play chess, but his endgame is weak.” Effective competitors are now emerging, such as Google Voice Search.
That the general public is now having conversations in natural spoken language with their handheld computers marks a new era. It is typical that people dismiss the significance of a first-generation technology because of its limitations. A few years later, when the technology does work well, people still dismiss its importance because, well, it’s no longer new. That being said, Siri works impressively for a first-generation product, and it is clear that this category of product is only going to get better.
Siri uses the HMM-based speech recognition technologies from Nuance. The natural-language extensions were first developed by the DARPA-funded “CALO” project.15 Siri has been enhanced with Nuance’s own natural-language technologies, and Nuance offers a very similar technology called Dragon Go!16
The methods used for understanding natural language are very similar to hierarchical hidden Markov models, and indeed HHMM itself is commonly used. Whereas some of these systems are not specifically labeled as using HMM or HHMM, the mathematics is virtually identical. They all involve hierarchies of linear sequences where each element has a weight, connections that are self-adapting, and an overall system that self-organizes based on learning data. Usually the learning continues during actual use of the system. This approach matches the hierarchical structure of natural language—it is just a natural extension up the conceptual ladder from parts of speech to words to phrases to semantic structures. It would make sense to run a genetic algorithm on the parameters that control the precise learning algorithm of this class of hierarchical learning systems and determine the optimal algorithmic details.
Over the past decade there has been a shift in the way that these hierarchical structures are created. In 1984 Douglas Lenat (born in 1950) started the ambitious Cyc (for enCYClopedic) project, which aimed to create rules that would codify everyday “commonsense” knowledge. The rules were organized in a huge hierarchy, and each rule involved—again—a linear sequence of states. For example, one Cyc rule might state that a dog has a face. Cyc can then link to general rules about the structure of faces: that a face has two eyes, a nose, and a mouth, and so on. We don’t need to have one set of rules for a dog’s face and then another for a cat’s face, though we may of course want to put in additional rules for ways in which dogs’ faces differ from cats’ faces. The system also includes an inference engine: If we have rules that state that a cocker spaniel is a dog, that dogs are animals, and that animals eat food, and if we were to ask the inference engine whether cocker spaniels eat, the system would respond that yes, cocker spaniels eat food. Over the next twenty years, and with thousands of person-years of effort, over a million such rules were written and tested. Interestingly, the language for writing Cyc rules—called CycL—is almost identical to LISP.
Meanwhile, an opposing school of thought believed that the best approach to natural-language understanding, and to creating intelligent systems in general, was through automated learning from exposure to a very large number of instances of the phenomena the system was trying to master. A powerful example of such a system is Google Translate, which can translate to and from fifty languages. That’s 2,500 different translation directions, although for most language pairs, rather than translate language 1 directly into language 2, it will translate language 1 into English and then English into language 2. That reduces the number of translators Google needed to build to ninety-eight (plus a limited number of non-English pairs for which there is direct translation). The Google translators do not use grammatical rules; rather, they create vast databases for each language pair of common translations based on large “Rosetta stone” corpora of translated documents between two languages. For the six languages that constitute the official languages of the United Nations, Google has used United Nations documents, as they are published in all six languages. For less common languages, other sources have been used.