Despite the convergence of frequentist and Bayesian thinking on some issues, there remains the philosophical difference about the meaning of probability. Viewing it as subjective makes many scientists queasy, but it also enables many otherwise-forbidden uses. If you’re a frequentist, you can only estimate probabilities of events that can occur more than once. So a question like “What is the probability that Hillary Clinton will beat Jeb Bush in the next presidential election?” is unanswerable, because there’s never been an election pitting them against each other. But for a Bayesian, a probability is a subjective degree of belief, so he’s free to make an educated guess, and the inference calculus keeps all his guesses consistent.
The Bayesian method is not just applicable to learning Bayesian networks and their special cases. (Conversely, despite their name, Bayesian networks aren’t necessarily Bayesian: frequentists can learn them, too, as we just saw.) We can put a prior distribution on any class of hypotheses-sets of rules, neural networks, programs-and then update it with the hypotheses’ likelihood given the data. Bayesians’ view is that it’s up to you what representation you choose, but then you have to learn it using Bayes’ theorem. In the 1990s, they mounted a spectacular takeover of the Conference on Neural Information Processing Systems (NIPS for short), the main venue for connectionist research. The ringleaders (so to speak) were David MacKay, Radford Neal, and Michael Jordan. MacKay, a Brit who was a student of John Hopfield’s at Caltech and later became chief scientific advisor to the UK’s Department of Energy, showed how to learn multilayer perceptrons the Bayesian way. Neal introduced the connectionists to MCMC, and Jordan introduced them to variational inference. Finally, they pointed out that in the limit you could “integrate out” the neurons in a multilayer perceptron, leaving a type of Bayesian model that made no reference to them. Before long, the word neural in the title of a paper submitted to NIPS became a good predictor of rejection. Some researchers joked that the conference should change its name to BIPS, for Bayesian Information Processing Systems.
Markov weighs the evidence
But something funny happened on the way to world domination. Researchers using Bayesian models kept noticing that you got better results by tweaking the probabilities in illegal ways. For example, raising P(words) to some power in speech recognizers improved accuracy, but then it wasn’t Bayes’ theorem any more. What was going on? The culprit, it turns out, was the false independence assumptions that generative models make. The simplified graph structure makes the models learnable and is worth keeping, but then we’re better off just learning the best parameters we can for the task at hand, irrespective of whether they’re probabilities. The real strength of, say, Naïve Bayes is that it provides a small, informative set of features from which to predict the class and a fast, robust way to learn the corresponding parameters. In a spam filter, each feature is the occurrence of a particular word in spam, and the corresponding parameter is how often it occurs; and similarly for nonspam. Viewed in this way, Naïve Bayes can be optimal, in the sense of making the best predictions possible, even in many cases where its independence assumptions are wildly violated. When I realized this and published a paper about it in 1996, people’s suspicion of Naïve Bayes melted away, helping it to take off. But it was also a step on the way to a different kind of model, which in the last two decades has increasingly replaced Bayesian networks in machine learning: Markov networks.
A Markov network is a set of features and corresponding weights, which together define a probability distribution. A feature can be as simple as This is a ballad or as elaborate as This is a ballad by a hip-hop artist, with a saxophone riff and a descending chord progression. Pandora uses a large set of features, which it calls the Music Genome Project, to select songs to play for you. Suppose we plug them into a Markov network. If you like ballads, the weight of the corresponding feature goes up, and you’re more likely to hear ballads when you turn on Pandora. If you also like songs by hip-hop artists, that feature’s weight also goes up. The songs you’re most likely to hear are now ones that have both features, namely ballads by hip-hop artists. If you don’t like ballads or hip-hop artists per se, but only enjoy them in combination, the more elaborate feature Ballad by a hip-hop artist is what you need. Pandora’s features are handcrafted, but in Markov networks we can also learn features using hill climbing, similar to rule induction. Either way, gradient descent is a good way to learn the weights.
Like Bayesian networks, Markov networks can be represented by graphs, but they have undirected arcs instead of arrows. Two variables are connected, meaning they depend directly on each other, if they appear together in some feature, like Ballad and By a hip-hop artist in Ballad by a hip-hop artist.
Markov networks are a staple in many areas, such as computer vision. For instance, a driverless car needs to segment each image it sees into road, sky, and countryside. One option is to label each pixel as one of the three according to its color, but this is not nearly good enough. Images are very noisy and variable, and the car will hallucinate rocks strewn all over the roadway and patches of road in the sky. We know, however, that nearby pixels in an image are usually part of the same object, and we can introduce a corresponding set of features: for each pair of neighboring pixels, the feature is true if they belong to the same object, and false otherwise. Now images with large, contiguous blocks of road and sky are much more likely than images without, and the car goes straight instead of continually swerving left and right to avoid imaginary rocks.
Markov networks can be trained to maximize either the likelihood of the whole data or the conditional likelihood of what we want to predict given what we know. For Siri, the likelihood of the whole data is P(words, sounds), and the conditional likelihood we’re interested in is P(words | sounds). By optimizing the latter, we can ignore P(sounds), which is only a distraction from our goal. And since we ignore it, it can be arbitrarily complex. This is much better than HMMs’ unrealistic assumption that sounds depend solely on the corresponding words, without any influence from the surroundings. In fact, if all Siri cares about is figuring out which words you just spoke, perhaps it doesn’t even need to worry about probabilities; it just needs to make sure the correct words score higher than incorrect ones when it tots up the weights of their features-ideally a lot higher, just to be safe.
Analogizers took this line of reasoning to its logical conclusion, as we’ll see in the next chapter. In the first decade of the new millennium, they in turn took over NIPS. Now the connectionists dominate once more, under the banner of deep learning. Some say that research goes in cycles, but it’s more like a spiral, with loops winding around the direction of progress. In machine learning, the spiral converges to the Master Algorithm.
Logic and probability: The star-crossed couple
You’d think that Bayesians and symbolists would get along great, given that they both believe in a first-principles approach to learning, rather than a nature-inspired one. Far from it. Symbolists don’t like probabilities and tell jokes like “How many Bayesians does it take to change a lightbulb? They’re not sure. Come to think of it, they’re not sure the lightbulb is burned out.” More seriously, symbolists point to the high price we pay for probability. Inference suddenly becomes a lot more expensive, all those numbers are hard to understand, we have to deal with priors, and hordes of zombie hypotheses chase us around forever. The ability to compose pieces of knowledge on the fly, so dear to symbolists, is gone. Worst of all, we don’t know how to put probability distributions on many of the things we need to learn. A Bayesian network is a distribution over a vector of variables, but what about distributions over networks, databases, knowledge bases, languages, plans, and computer programs, to name a few? All of these are easily handled in logic, and an algorithm that can’t learn them is clearly not the Master Algorithm.