Will Big Data Give Us a Whole Bunch of Questionable Correlations?
I think, statistically speaking, there’s no way that it won’t.
I’m listening to a recent episode of Seth Godin’s podcast entitled Sample Size http://aca.st/0d21e1.
Go listen to it, really.
What struck me from some of the examples that he gave, especially when talking about the football prediction sites, is that as we collect more and more data, we will, by shear force of numbers, end up correlating different things that actually have nothing to do with one another.
To wit, Seth describes starting up 200 websites with 100 predicting Team A would win, and 100 Team B. Then, after Team B won, shutting down all the sites that predicted Team A, and so on for weeks on end. Until you got to the Super Bowl and had one site left that had predicted every game correctly so far. People viewing the site would assume, incorrectly, that the person making these predictions must really know something. What they know is that if you run the predictions randomly enough times, over enough websites, one of them will likely end up predicting them all correctly. But it means nothing for the next game, because it was all just random noise in all of the data points.
Since I usually read stuff about Big Data and AI, this caught my attention immediately, because when we feed enough data into an algorithm, looking for correlations, it will find tons of them. In fact, we are quickly coming up on a time when finding correlations is easy. It takes almost no skill. Anyone will be able to do it. (If we aren’t already there)
The smart people, however, will understand which ones matter, and which ones can be acted upon.
For example, if I own a business, and I have enough data, I may discover that over the last year, my online store sells more widgets during a week after there’s a music festival in Terre Haute, Indiana. I might, therefore, be tempted to try and sponsor a new music festival in Indiana, to give me another week of increased sales, right? There’s a correlation!
But is the increase in sales being driven by that music festival? Is it just a random coincidence that our online sales went up slightly during those same couple of weeks?
Now this may seem obvious and probably doesn’t require a lot of understanding to say that it’s unlikely that one is causing the other, but there are going to be, literally, hundreds of these kinds of correlations that data and AI are going to bring to light. The people who are able to understand which ones matter, and which ones are not relevant, will win in business. Those who don’t, will end up making big mistakes chasing down correlations that are not relevant, but just random.
Then, extrapolate that out into the public realm. As AI and big data shapes more and more of our public policy decisions, will the people shaping those policies be smart enough to understand the correlations, and which ones matter? Will our decisions about policing, criminal sentencing, economics, even predicting things like natural disasters, climate change, or terrorism risks get skewed by random data points that look like a true causation but is really just a correlation of no consequence? And will we end up missing true, very real, risks as they get lost in the noise of hundreds, maybe thousands, of random correlations?
These are serious questions, and it’s going to take some really smart people to understand what the data spits out and act on it appropriately.
Do we have enough of them?
Follow these topics: Tech