Believe in Science? Bad Big-Data Studies May Shake Your Faith

Gary Smith

27 Apr 2022, 01:08 AM IST

(Bloomberg Opinion) -- Coffee was wildly popular in Sweden in the 17th century — and also illegal. King Gustav III believed that it was a slow poison and devised a clever experiment to prove it. He commuted the sentences of murderous twin brothers who were waiting to be beheaded, on one condition: One brother had to drink three pots of coffee every day while the other drank three pots of tea. The early death of the coffee-drinker would prove that coffee was poison.

It turned out that the coffee-drinking twin outlived the tea drinker, but it wasn’t until the 1820s that Swedes were finally legally permitted to do what they had been doing all along — drink coffee, lots of coffee.

The cornerstone of the scientific revolution is the insistence that claims be tested with data, ideally in a randomly controlled trial. Gustav’s experiment was noteworthy for his use of identical male twins, which eliminated the confounding effects of sex, age and genes. The most glaring weakness was that nothing statistically persuasive can come from such a small sample.

Today, the problem is not the scarcity of data, but the opposite. We have too much data, and it is undermining the credibility of science.

Luck is inherent in random trials. In a medical study, some patients may be healthier. In an agricultural study, some soil may be more fertile. In an educational study, some students may be more motivated. Researchers consequently calculate the probability (the p-value) that the outcomes might happen by chance. A low p-value indicates that the results cannot easily be attributed to the luck of the draw.

How low? In the 1920s, the great British statistician Ronald Fisher said that he considered p-values below 5% to be persuasive and, so, 5% became the hurdle for the “statistically significant” certification needed for publication, funding and fame.

It is not a difficult hurdle. Suppose that a hapless researcher calculates the correlations among hundreds of variables, blissfully unawarethat the data are all, in fact, random numbers. On average, one out of 20 correlations will be statistically significant, even though every correlation is nothing more than coincidence.

Real researchers don’t correlate random numbers but, all too often, they correlate what are essentially randomly chosen variables. This haphazard search for statistical significance even has a name: data mining. As with random numbers, the correlation between randomly chosen, unrelated variables has a 5% chance of being fortuitously statistically significant. Data mining can be augmented by manipulating, pruning and otherwise torturing the data to get low p-values.

To find statistical significance, one need merely look sufficiently hard. Thus, the 5% hurdle has had the perverse effect of encouraging researchers to do more tests and report more meaningless results.

Thus, silly relationships are published in good journals simply because the results are statistically significant.

Students do better on a recall test if they study for the test after taking it (Journal of Personality and Social Psychology).
Japanese-Americans are prone to heart attacks on the fourth day of the month (British Medical Journal).
Bitcoin prices can be predicted from stock returns in the paperboard, containers and boxes industry (National Bureau of Economic Research).
Elderly Chinese women can postpone their deaths until after the celebration of the Harvest Moon Festival (Journal of the American Medical Association).
Women who eat breakfast cereal daily are more likely to have male babies (Proceedings of the Royal Society).
People can use power poses to increase their dominance hormone testosterone and reduce their stress hormone cortisol (Psychological Science).
Hurricanes are deadlier if they have female names (Proceedings of the National Academy of Sciences).
Investors can obtain a 23% annual return in the market by basing their buy/sell decisions on the number of Google searches for the word “debt” (Scientific Reports).

These now-discredited studies are the tip of a statistical iceberg that has come to be known as the replication crisis.

A team led by John Ioannidis looked at attempts to replicate 34 highly respected medical studies and found that only 20 were confirmed. The Reproducibility Project attempted to replicate 97 studies published in leading psychology journals and confirmed only 35. The Experimental Economics Replication Project attempted to replicate 18 experimental studies reported in leading economics journals and confirmed only 11.

I wrote a satirical paper that was intended to demonstrate the folly of data mining. I looked at Donald Trump’s voluminous tweets and found statistically significant correlations between: Trump tweeting the word “president” and the S&P 500 index two days later; Trump tweeting the word “ever” and the temperature in Moscow four days later; Trump tweeting the word “more” and the price of tea in China four days later; and Trump tweeting the word “democrat” and some random numbers I had generated.

I concluded — tongue as firmly in cheek as I could hold it — that I had found “compelling evidence of the value of using data-mining algorithms to discover statistically persuasive, heretofore unknown correlations that can be used to make trustworthy predictions.”

I naively assumed that readers would get the point of this nerd joke: Large data sets can easily be mined and tortured to identify patterns that are utterly useless. I submitted the paper to an academic journal and the reviewer’s comments demonstrate beautifully how deeply embedded is the notion that statistical significance supersedes common sense: “The paper is generally well written and structured. This is an interesting study and the authors have collected unique datasets using cutting-edge methodology.”

It is tempting to believe that more data means more knowledge. However, the explosion in the number of things that are measured and recorded has magnified beyond belief the number of coincidental patterns and bogus statistical relationships waiting to deceive us.

If the number of true relationships yet to be discovered is limited, while the number of coincidental patterns is growing exponentially with the accumulation of more and more data, then the probability that a randomly discovered pattern is real is inevitably approaching zero.

The problem today is not that we have too few data, but that we have too much data, which seduces researchers into ransacking it for patterns that are easy to find, likely to be coincidental, and unlikely to be useful.

This column does not necessarily reflect the opinion of the editorial board or Bloomberg LP and its owners.

Gary Smith, an economics professor at Pomona College, is the author of "The AI Delusion" and the forthcoming "Distrust: Big Data, Data-Torturing, and the Assault on Science.