A Big Problem with Big Data

21 Jan 2020 • 5 minute read

I happened to read a blog post that referred to a 2018 paper in The Annals of Applied Statistics with the title Statistical Paradises and Paradoxes in Big Data: Law of Large Populations, Big Data Paradox, and the 2016 Presidential Election. The 2016 election was used since there is a lot of publicly accessible data ("big data"), this is not any sort of political analysis. It does, however, point to why the election was a surprise, in the sense of differing from almost all the polls made before election day.

Since we are in the era of big data, and deep learning from it, there is a sort of unstated assumption that big data leads to more accurate answers. But the caveat of the paper is:

compensating for quality with quantity is a doomed game.

Soup

Here's what seems like a paradox. I think I've given the soup analogy before in Breakfast Bytes, but I'll repeat it. Xiao-Li repeats it in his paper too—I guess we all learn this analogy in undergraduate statistics. Suppose you have a well-stirred pot of soup and you want to taste it to see if it needs more salt. Can you just use a teaspoon? Or do you need to sample 80% of the soup? Or take many different samples? Obviously, you can just use a single teaspoon. Importantly, it doesn't depend on the size of the pot either. Of course, for most human (and many other) populations, uniform stirring is not feasible, but probabilistic sampling does the same trick. How big a sample do you need? Just as with the spoon, if the sampling is uniform enough, how big a sample you need doesn't depend on the size of the pot/population. The size of the population that you are sampling from doesn't appear in the equation for the size of the sample you need.

But the message from the paper is that when there is a bit of bias, even a tiny bit, then the size of the population does feed into the size of sample you need, and in a big way. That is why compensating for quality with quantity doesn't work.

Xiao-Li started the project when he was asked to help with statistical quality control by an agency, and was asked, "which should we trust more, a 5% survey sample, or 80% administrative dataset?". Before the era of big data, this would not really be a question to consider since you can rarely get hold of 80% of a population. The whole idea of survey sampling is to learn about a population without having to record a large chunk of it.

In the era of big data, we have lots of datasets that cover large percentages of their populations, but the data was not collected with the intention that they were probabilistic samples. But to use them, we need to know how much they can help versus doing a much smaller probabilistic sample. This is the topic of the paper.

An Example

Here is an astounding example from the 2016 election that appears towards the end of the paper. If we sample 1% of the population, 2.3 million, with a tiny bias, then what is the equivalent sample size that equates to with zero bias. To make it concrete, assume almost ideal conditions: nobody sampled lies, nobody changes their mind between being sampled and the election. To add a little bias, assume Trump voters are 0.1% more likely to decline to answer. That is, one in a thousand Trump voters more than non-Trump voters declines to answer. What is the equivalent sample size we would need (to get the same expected error in our poll) without that small Trump voter bias? The answer turns out to be 400. That is to say that sampling 2.3 million people with a tiny bias is equivalent to sampling just 400 people with no bias. Earlier in the paper, a similar calculation was done with Trump voters 0.5% more likely to decline but sampling half the population, namely 115M people, and the same 400 is the equivalent.

As Xiao-Li says in his paper:

Such dramatic reductions appear to be too extreme to be believable, a common reaction from the audiences whenever this result is presented. It is indeed extreme, but what should be unbelievable is the magical power of probabilistic sampling, which we all have taken for granted for too long.

The problem is that when there is bias in the sample, it is no longer the case that the size of the population doesn't appear in the sample size that we need. In the big data era, the population being sampled is often very large and so the sample required enormous. With significant bias, it approaches 100% of the population to get something equivalent to a random sample of less than 1000 people.

Equivalently, given the practical size of samples, the errors are much bigger than traditional statistical measures implied. The traditional approach uses a formula that gives an error 0f 0.06%. The approach in this paper has a margin of error that is about 5%, 83 times as large, leading to gross overconfidence in what the data tells us. As a further confirmation, when each state is considered separately, the errors in the predictions for Trump voters are larger in the larger states.

This graph shows the near impossibility of getting effective sample sizes (SRS - statistically representative sample).

The Takeaway

This should remind us that, without taking data quality into account, population inferences with Big Data are subject to a Big Data Paradox: the more the data, the surer we fool ourselves.

It is essentially wishful thinking to rely on the “bigness” of Big Data to protect us from its questionable quality, especially for large populations.

Or, more technically:

Under a probabilistic sampling, a central driving force for the stochastic behaviors of the sample mean (and alike) is the sample size n. This is the case for the Law of Large Numbers (LLN) and for the Central Limit Theorem (CLT), two pillars of theoretical statistics, and of much of applied statistics because we rely on LLN and CLT to build intuitions, heuristic arguments, or even models. However, [the equation] implies that once we lose control of probabilistic sampling, then the driving force behind the estimation error is no longer the sample size n, but rather the population size N.

Sign up for Sunday Brunch, the weekly Breakfast Bytes email.