Indeed, Alpha consists of 15 million lines of Mathematica code. What Alpha is doing is literally computing the answer from approximately 10 trillion bytes of data that have been carefully curated by the Wolfram Research staff. You can ask a wide range of factual questions, such as “What country has the highest GDP per person?” (Answer: Monaco, with $212,000 per person in U.S. dollars), or “How old is Stephen Wolfram?” (Answer: 52 years, 9 months, 2 days as of the day I am writing this). As mentioned, Alpha is used as part of Apple’s Siri; if you ask Siri a factual question, it is handed off to Alpha to handle. Alpha also handles some of the searches posed to Microsoft’s Bing search engine.
In a recent blog post, Dr. Wolfram reported that Alpha is now providing successful responses 90 percent of the time.17 He also reports an exponential decrease in the failure rate, with a half-life of around eighteen months. It is an impressive system, and uses handcrafted methods and hand-checked data. It is a testament to why we created computers in the first place. As we discover and compile scientific and mathematical methods, computers are far better than unaided human intelligence in implementing them. Most of the known scientific methods have been encoded in Alpha, along with continually updated data on topics ranging from economics to physics. In a private conversation I had with Dr. Wolfram, he estimated that self-organizing methods such as those used in Watson typically achieve about an 80 percent accuracy when they are working well. Alpha, he pointed out, is achieving about a 90 percent accuracy. Of course, there is self-selection in both of these accuracy numbers in that users (such as myself) have learned what kinds of questions Alpha is good at, and a similar factor applies to the self-organizing methods. Eighty percent appears to be a reasonable estimate of how accurate Watson is on Jeopardy! queries, but this was sufficient to defeat the best humans.
It is my view that self-organizing methods such as I articulated in the pattern recognition theory of mind are needed to understand the elaborate and often ambiguous hierarchies we encounter in real-world phenomena, including human language. An ideal combination for a robustly intelligent system would be to combine hierarchical intelligence based on the PRTM (which I contend is how the human brain works) with precise codification of scientific knowledge and data. That essentially describes a human with a computer. We will enhance both poles of intelligence in the years ahead. With regard to our biological intelligence, although our neocortex has significant plasticity, its basic architecture is limited by its physical constraints. Putting additional neocortex into our foreheads was an important evolutionary innovation, but we cannot now easily expand the size of our frontal lobes by a factor of a thousand, or even by 10 percent. That is, we cannot do so biologically, but that is exactly what we will do technologically.
A Strategy for Creating a Mind
There are billions of neurons in our brains, but what are neurons? Just cells. The brain has no knowledge until connections are made between neurons. All that we know, all that we are, comes from the way our neurons are connected.
Let’s use the observations I have discussed above to begin building a brain. We will start by building a pattern recognizer that meets the necessary attributes. Next we’ll make as many copies of the recognizer as we have memory and computational resources to support. Each recognizer computes the probability that its pattern has been recognized. In doing so, it takes into consideration the observed magnitude of each input (in some appropriate continuum) and matches these against the learned size and size variability parameters associated with each input. The recognizer triggers its simulated axon if that computed probability exceeds a threshold. This threshold and the parameters that control the computation of the pattern’s probability are among the parameters we will optimize with a genetic algorithm. Because it is not a requirement that every input be active for a pattern to be recognized, this provides for autoassociative recognition (that is, recognizing a pattern based on only part of the pattern being present). We also allow for inhibitory signals (signals that indicate that the pattern is less likely).
Recognition of the pattern sends an active signal up the simulated axon of this pattern recognizer. This axon is in turn connected to one or more pattern recognizers at the next higher conceptual level. All of the pattern recognizers connected at the next higher conceptual level are accepting this pattern as one of its inputs. Each pattern recognizer also sends signals down to pattern recognizers at lower conceptual levels whenever most of a pattern has been recognized, indicating that the rest of the pattern is “expected.” Each pattern recognizer has one or more of these expected signal input channels. When an expected signal is received in this way, the threshold for recognition of this pattern recognizer is lowered (made easier).
The pattern recognizers are responsible for “wiring” themselves to other pattern recognizers up and down the conceptual hierarchy. Note that all the “wires” in a software implementation operate via virtual links (which, like Web links, are basically memory pointers) and not actual wires. This system is actually much more flexible than that in the biological brain. In a human brain, new patterns have to be assigned to an actual physical pattern recognizer, and new connections have to be made with an actual axon-to-dendrite link. Usually this means taking an existing physical connection that is approximately what is needed and then growing the necessary axon and dendrite extensions to complete the full connection.
Another technique used in biological mammalian brains is to start with a large number of possible connections and then prune the neural connections that are not used. If a biological neocortex reassigns cortical pattern recognizers that have already learned older patterns in order to learn more recent material, then the connections need to be physically reconfigured. Again, these tasks are much simpler in a software implementation. We simply assign new memory locations to a new pattern recognizer and use memory links for the connections. If the digital neocortex wishes to reassign cortical memory resources from one set of patterns to another, it simply returns the old pattern recognizers to memory and then makes the new assignment. This sort of “garbage collection” and reassignment of memory is a standard feature of the architecture of many software systems. In our digital brain we would also back up old memories before discarding them from the active neocortex, a precaution we can’t take in our biological brains.
There are a variety of mathematical techniques that can be employed to implement this approach to self-organizing hierarchical pattern recognition. The method I would use is hierarchical hidden Markov models, for several reasons. From my personal perspective, I have several decades of familiarity with this method, having used it in the earliest speech recognition and natural-language systems starting in the 1980s. From the perspective of the overall field, there is greater experience with hidden Markov models than with any other approach for pattern recognition tasks. They are also extensively used in natural-language understanding. Many NLU systems use techniques that are at least mathematically similar to HHMM.
Note that not all hidden Markov model systems are fully hierarchical. Some allow for just a few levels of hierarchy—for example, going from acoustic states to phonemes to words. To build a brain, we will want to enable our system to create as many new levels of hierarchy as needed. Also, most hidden Markov model systems are not fully self-organizing. Some have fixed connections, although these systems do effectively prune many of their starting connections by allowing them to evolve zero connection weights. Our systems from the 1980s and 1990s automatically pruned connections with connection weights below a certain level and also allowed for making new connections to better model the training data and to learn on the fly. A key requirement, I believe, is to allow for the system to flexibly create its own topologies based on the patterns it is exposed to while learning. We can use the mathematical technique of linear programming to optimally assign connections to new pattern recognizers.