Выбрать главу

It is not just the history of humans that has benefited from technological advances that can create vast amounts of data. The particle collisions which happen at the LHC generate so much data on the fundamental particles that make up the universe that only a fraction of the data – the fraction that looks like it might reveal exciting new science – is stored, and the rest is discarded. Space telescopes like the recently launched James Webb Space Telescope also produce vast amounts of data. The orbiting telescope sends back to Earth large numbers of high-quality images of galaxies billions of light years and trillions of miles away. You, too, create lots of data, as your phone communicates with phone masts, and as it records your steps or the number of stairs you climb. Your social media posts, and the posts you look at, generate networks of views, and the transactions you make with your credit and debit cards are all creating data. It is estimated that humanity generates 1.145 trillion megabytes of data every day. To give you an idea of quite how much information that is, I can save the 130,000 or so words that constitute this book in just under half a megabyte.

The trend for ever more data will likely continue, with humanity doubling our amount of data every eighteen months to two years. Back in 1945, the doubling time used to be twenty-five years. No one can keep track of all these new data being produced – data is plural, datum is singular – and analysing even bits of it is a challenge that is being met via the emergence of a new field of science: data science. If you can wrangle and analyse data you will be in demand in jobs as diverse as banking, political influencing and gaming. As the range of equipment available to scientists to make ever more detailed observations increases, so too will the rate of data generation, and quite possibly human knowledge too.

To collect data, you need something to observe and measure, whether that material is a DNA sequence, the debris from particles colliding at the LHC or a 125-million-year-old T. rex fossil. There are more modern-day things to measure than there are objects from the past. We know more about the life of Thales of Miletus than we do about woman X, in part because woman X lived thousands of years before Thales was born. She was born before the first cities, and before writing was invented and records were kept. But detailed analyses of information written in her DNA have provided us with a picture of part of the history of humans. Thanks to our ability to extract genetic samples from woman X’s finger and to our understanding of the general rules of how genetics work, we have been able to piece together a picture of how humans evolved. Although the hypotheses that are tested are much more complex than those concerning the presence of a particular tree in a specific location, they are all based on the scientific method. This is why the scientific method is humanity’s greatest achievement. Not only can it answer questions about our history, about the location and adaptations of trees, but also about the history of the universe.

Compelling observation is key to good science, and we have got very good at making detailed and diverse observations. But it is not the whole story. All these data create challenges. Not only can analysing data require a lot of computational resources, but how strong does a pattern have to be for us to be confident it hasn’t just arisen by chance, or that the pattern is something we should worry about?

Of my undergraduate students, 61.4 per cent do not like statistics. Convincing most people that statistics is fun, or useful, can be far from straightforward. But statistical analysis is one of the cornerstones of modern science. The aim of statistics is to provide some measure of confidence that a pattern in data is real.

Imagine you have a large bag of sweets. You dip your hand in and remove one sweet. It is red. Does that mean that all the sweets in the bag are red? Intuitively you know you cannot conclude that because you have only looked at – or sampled, as statisticians would say – one sweet out of perhaps several hundred. If you sampled two sweets from the bag and they were both red, you might start to be a little more confident that you have a bag containing only red sweets. Your confidence would grow that you had a bag of red sweets if you sampled fifty sweets and they were all red, but you would still not be 100 per cent confident that all sweets in the bag were red. You can only be 100 per cent confident if you have examined all the sweets and you found that they were all red. What statistics does is provide a way to put a measure of confidence in any patterns you might find in data. In this case, your data is the number of red sweets you have pulled from the bag, and the pattern is they are all red. You want to be able to say something like ‘I am 85 per cent confident all the sweets are red.’ In the same way we are all scientists at heart, we are also all statisticians. We are endlessly making decisions based on our confidence that something will, or will not, happen. As I wrote this, I was gambling someone would want to publish this book.

It was a biologist, Ronald Fisher, who invented modern statistics. He was born in England in 1890 and died in Australia in 1962, aged seventy-two. The list of statistical tools Fisher developed is remarkable, and anyone who has ever had to sit through lessons on statistics can thank Fisher because he invented things with names like ‘maximum likelihood’, ‘analysis of variance’ and the ‘F-test’. These are still taught in undergraduate science, medicine and social science courses.

Since Fisher laid the foundations of statistics, the field has advanced considerably. There is now a plethora of exotic-sounding statistical methods including reversible-jump Markov Chain Monte Carlo, trans-dimensional simulating annealing, and hierarchical Bayesian multistate modelling. All these methods stem from Fisher’s fundamental insight that variation in the observations scientists make can be broken down into contributions from different sources. To illustrate this logic, imagine you measured the weight of everyone who lives in your neighbourhood. Different people will have different weights, which means there is variation in weight within the population you have collected the data from. Statistical methods allow you to explain where this variation comes from. Is it due to diet, where you were born, your genes, how much exercise you do, your gender, your height or your age? Fisher developed methods to estimate how factors such as these contributed to variation in data, and to assign a measure of confidence in the numbers produced.

Depending on how much data has been collected, the degree of confidence that is considered acceptable can vary. In physics, the level of confidence frequently used is called five sigma, and what this means is that there is about a one in a million chance that the result happened by chance alone. In other fields, where fewer data exist, such as palaeontology, the level of confidence that is deemed appropriate to discuss a finding as likely being real and not due to luck is one in twenty.

Statistics is a way of putting confidence in any patterns found in data, and statistical tests can be applied to both observational and experimental data. When a pattern is identified in observational data, the next step, if possible, is to construct an experiment to explore whether a process or mechanism hypothesized to generate the pattern is plausible. Revisiting my ‘why is that tree there?’ discussion from earlier in the chapter, I might plant seeds from a range of trees in a range of soil conditions and see what germinates. Good experiments have various treatments and a control. Different seed types, and different soil types, would be the treatments in this example, while the controls might be no seeds planted, or seeds without soil, or indeed no seeds or soil to explore whether the spontaneous emergence of life forms can occur. (It can’t.)