Statistics and big data
By David J. Hand
Nowadays it appears impossible to open a newspaper or switch on the television without hearing about “big data”. Big data, it sometimes seems, will provide answers to all the world’s problems. Management consulting company McKinsey, for example, promises “a tremendous wave of innovation, productivity, and growth … all driven by big data”.
An alien observer visiting the Earth might think it represents a major scientific breakthrough. Google Trends shows references to the phrase bobbing along at about one per week until 2011, at which point there began a dramatic, steep, and almost linear increase in references to the phrase. It’s as if no one had thought of it until 2011. Which is odd because data mining, the technology of extracting valuable, useful, or interesting information from large data sets, has been around for some 20 years. And statistics, which lies at the heart of all of this, has been around as a formal discipline for a century or more.
Or perhaps it’s not so odd. If you look back to the beginning of data mining, you find a very similar media enthusiasm for the advances it was going to bring, the breakthroughs in understanding, the sudden discoveries, the deep insights. In fact it almost looks as if we have been here before. All of this leads one to suspect that there’s less to the big data enthusiasm than meets the eye. That it’s not so much a sudden change in our technical abilities as a sudden media recognition of what data scientists, and especially statisticians, are capable.
Of course, I’m not saying that the increasing size of data sets does not lead to promising new opportunities – though I would question whether it’s the “large” that really matters as much as the novelty of the data sets. The tremendous economic impact of GPS data (estimated to be $150-270bn per year), retail transaction data, or genomic and bioinformatics data arise not from the size of these data sets, but from the fact that they provide new kinds of information. And while it’s true that a massive mountain of data needed to be explored to detect the Higgs boson, the core aspect was the nature of the data rather than its amount.
Moreover, if I’m honest, I also have to admit that it’s not solely statistics which leads to the extraction of value from these massive data sets. Often it’s a combination of statistical inferential methods (e.g. determining an accurate geographical location from satellite signals) along with data manipulation algorithms for search, matching, sorting and so on. How these two aspects are balanced depends on the particular application. Locating a shop which stocks that out of print book is less of an inferential statistical problem and more of a search issue. Determining the riskiness of a company seeking a loan owes little to search but much to statistics.
Some time after the phrase “data mining” hit the media, it suffered a backlash. Predictably enough, much of this was based around privacy concerns. A paradigmatic illustration was the Total Information Awareness project in the United States. Its basic aim was to search for suspicious behaviour patterns within vast amounts of personal data, to identify individuals likely to commit crimes, especially terrorist offences. It included data on web browsing, credit card transactions, driving licences, court records, passport details, and so on. After concerns were raised, it was suspended in 2003 (though it is claimed that the software continued to be used by various agencies). As will be evident from recent events, concerns about the security agencies monitoring of the public continues.
The key question is whether proponents of the huge potential of big data and its allied notion of open data are learning from the past. Recent media concern in the UK about merging of family doctor records with hospital records, leading to a six-month delay in the launch of the project, illustrates the danger. Properly informed debate about the promise and the risks is vital.
Technology is amoral — neither intrinsically moral nor immoral. Morality lies in the hands of those who wield it. This is as true of big data technology as it is of nuclear technology and biotechnology. It is abundantly clear — if only from the examples we have already seen — that massive data sets do hold substantial promise for enhancing the well-being of mankind, but we must be aware of the risks. A suitable balance must be struck.
It’s also important to note that the mere existence of huge data files is of itself of no benefit to anyone. For these data sets to be beneficial, it’s necessary to be able to use the data to build models, to estimate effect sizes, to determine if an observed effect should be regarded as mere chance variation, to be sure it’s not a data quality issue, and so on. That is, statistical skills are critical to making use of the big data resources. In just the same way that vast underground oil reserves were useless without the technology to turn them into motive power, so the vast collections of data are useless without the technology to analyse them. Or, as I sometimes put it, people don’t want data, what they want are answers. And statistics provides the tools for finding those answers.
The Very Short Introductions (VSI) series combines a small format with authoritative analysis and big ideas for hundreds of topic areas. Written by our expert authors, these books can change the way you think about the things that interest you and are the perfect introduction to subjects you previously knew nothing about. Grow your knowledge with OUPblog and the VSI series every Friday and like Very Short Introductions on Facebook. Subscribe to on Very Short Introductions articles on the OUPblog via email or RSS.
Subscribe to the OUPblog via email or RSS.
Subscribe to only mathematics articles on the OUPblog via email or RSS
Image credit: Diagram of Total Information Awareness system designed by the Information Awareness Office. Public domain via Wikimedia Commons