Выбрать главу

Two images from the cover of the book Perceptrons by Marvin Minsky and Seymour Papert. The top image is not connected (that is, the dark area consists of two disconnected parts). The bottom image is connected. A human can readily determine this, as can a simple software program. A feedforward Perceptron such as Frank Rosenblatt’s Mark 1 Perceptron cannot make this determination.

Perceptrons, however, was widely interpreted to imply more than it actually did. Minsky and Papert’s theorem applied only to a particular type of neural net called a feedforward neural net (a category that does include Rosenblatt’s Perceptron); other types of neural nets did not have this limitation. Still, the book did manage to largely kill most funding for neural net research during the 1970s. The field did return in the 1980s with attempts to use what were claimed to be more realistic models of biological neurons and ones that avoided the limitations implied by the Minsky-Papert Perceptron theorem. Nevertheless, the ability of the neocortex to solve the invariance problem, a key to its strength, was a skill that remained elusive for the resurgent connectionist field.

Sparse Coding: Vector Quantization

In the early 1980s I started a project devoted to another classical pattern recognition problem: understanding human speech. At first, we used traditional AI approaches by directly programming expert knowledge about the fundamental units of speech—phonemes—and rules from linguists on how people string phonemes together to form words and phrases. Each phoneme has distinctive frequency patterns. For example, we knew that vowels such as “e” and “ah” are characterized by certain resonant frequencies called formants, with a characteristic ratio of formants for each phoneme. Sibilant sounds such as “z” and “s” are characterized by a burst of noise that spans many frequencies.

We captured speech as a waveform, which we then converted into multiple frequency bands (perceived as pitches) using a bank of frequency filters. The result of this transformation could be visualized and was called a spectrogram (see page 136).

The filter bank is copying what the human cochlea does, which is the initial step in our biological processing of sound. The software first identified phonemes based on distinguishing patterns of frequencies and then identified words based on identifying characteristic sequences of phonemes.

A spectrogram of three vowels. From left to right: [i] as in “appreciate,” [u] as in “acoustic,” and [a] as in “ah.” The Y axis represents frequency of sound. The darker the band the more acoustic energy there is at that frequency.

A spectrogram of a person saying the word “hide.” The horizontal lines show the formants, which are sustained frequencies that have especially high energy.10

The result was partially successful. We could train our device to learn the patterns for a particular person using a moderate-sized vocabulary, measured in thousands of words. When we attempted to recognize tens of thousands of words, handle multiple speakers, and allow fully continuous speech (that is, speech with no pauses between words), we ran into the invariance problem. Different people enunciated the same phoneme differently—for example, one person’s “e” phoneme may sound like someone else’s “ah.” Even the same person was inconsistent in the way she spoke a particular phoneme. The pattern of a phoneme was often affected by other phonemes nearby. Many phonemes were left out completely. The pronunciation of words (that is, how phonemes are strung together to form words) was also highly variable and dependent on context. The linguistic rules we had programmed were breaking down and could not keep up with the extreme variability of spoken language.

It became clear to me at the time that the essence of human pattern and conceptual recognition was based on hierarchies. This is certainly apparent for human language, which constitutes an elaborate hierarchy of structures. But what is the element at the base of the structures? That was the first question I considered as I looked for ways to automatically recognize fully normal human speech.

Sound enters the ear as a vibration of the air and is converted by the approximately 3,000 inner hair cells in the cochlea into multiple frequency bands. Each hair cell is tuned to a particular frequency (note that we perceive frequencies as tones) and each acts as a frequency filter, emitting a signal whenever there is sound at or near its resonant frequency. As it leaves the human cochlea, sound is thereby represented by approximately 3,000 separate signals, each one signifying the time-varying intensity of a narrow band of frequencies (with substantial overlap among these bands).

Even though it was apparent that the brain was massively parallel, it seemed impossible to me that it was doing pattern matching on 3,000 separate auditory signals. I doubted that evolution could have been that inefficient. We now know that very substantial data reduction does indeed take place in the auditory nerve before sound signals ever reach the neocortex.

In our software-based speech recognizers, we also used filters implemented as software—sixteen to be exact (which we later increased to thirty-two, as we found there was not much benefit to going much higher than this). So in our system, each point in time was represented by sixteen numbers. We needed to reduce these sixteen streams of data into one while at the same emphasizing the features that are significant in recognizing speech.

We used a mathematically optimal technique to accomplish this, called vector quantization. Consider that at any particular point in time, sound (at least from one ear) was represented by our software by sixteen different numbers: that is, the output of the sixteen frequency filters. (In the human auditory system the figure would be 3,000, representing the output of the 3,000 cochlea inner hair cells.) In mathematical terminology, each such set of numbers (whether 3,000 in the biological case or 16 in our software implementation) is called a vector.

For simplicity, let’s consider the process of vector quantization with vectors of two numbers. Each vector can be considered a point in two-dimensional space.

If we have a very large sample of such vectors and plot them, we are likely to notice clusters forming.

In order to identify the clusters, we need to decide how many we will allow. In our project we generally allowed 1,024 clusters so that we could number them and assign each cluster a 10-bit label (because 210 = 1,024). Our sample of vectors represents the diversity that we expect. We tentatively assign the first 1,024 vectors to be one-point clusters. We then consider the 1,025th vector and find the point that it is closest to. If that distance is greater than the smallest distance between any pair of the 1,024 points, we consider it as the beginning of a new cluster. We then collapse the two (one-point) clusters that are closest together into a single cluster. We are thus still left with 1,024 clusters. After processing the 1,025th vector, one of those clusters now has more than one point. We keep processing points in this way, always maintaining 1,024 clusters. After we have processed all the points, we represent each multipoint cluster by the geometric center of the points in that cluster.