To the considerable surprise of those of us who cut our teeth on Minitab and had to derive the OLS estimator by hand, data analytics is a sexy topic these days. Led by the likes of Nate Silver at FiveThirtyEight, Ezra Klein at Vox, and Nate Cohn at the New York Times‘ Upshot, data nerds have made unprecedented forays into both the public sphere and the business world. At a time in which “truthiness” seems to have run amok, an increasingly widespread acceptance of the careful use of data in argumentation and analysis can only be welcome.
That’s the good news.
The bad news is that drawing meaningful inferences from data requires quite a bit more than simply looking at the data. It requires a rigorous understanding of the role of chance in producing outcomes, and it requires an understanding of how to bridge the gap between correlation and causation when working with observational data.
These are areas in which the social sciences can contribute greatly to data analytics. Social scientists are rapt consumers and, increasingly, producers of statistical methodologies to derive coherent conclusions from noisy data. And due to reasonable prohibitions on experimentation on human subjects, social scientists have long had to make do with observational data. In short, while many disciplines have encountered these issues, the social sciences have been plagued by them, and practitioners have developed a comparative advantage in dealing with them. For that reason, the social sciences are uniquely well-positioned to contribute to the further evolution of data analytics.
Three examples help to illustrate this point.
Vote fraud is a common problem in democratizing countries, but the efficacy of election monitors is difficult to gauge; because they tend to be sent to situations in which vote fraud is a major issue, the raw data could well show that elections with monitors are more corruption-prone than those without, even if election monitors are succeeding in reducing corruption. To deal with this problem, political scientist Susan Hyde took advantage of the fact that, in Armenia’s 2003 presidential elections, monitors from the Organization for Security and Cooperation in Europe were assigned to polling stations effectively at random. By examining the differences between results from polling stations with monitors and those without, Hyde is able to demonstrate that candidates who engage in fraud receive a significantly lower share of the votes in monitored polling stations than they do in unmonitored ones.
Other examples have to do with political attitudes. Data analysts, especially in the media, often attribute fluctuation in public opinion to changes in current events or recent statements by politicians. In contrast, a small but growing literature argues that political attitudes are remarkably stable over time—even across decades or centuries. Political scientists Avidit Acharya, Matthew Blackwell, and Maya Sen demonstrate that the prevalence of slavery in a county 150 years ago still has an impact on contemporary political attitudes, while economists Irena Grosfeld and Ekaterina Zhuravskaya demonstrate that the partitions of Poland in the late 18th century produced changes in political attitudes that persist to this day; Grosfeld and Zhuravskaya came to this conclusion by examining the spatial distribution of public opinion and discovering abrupt and significant discontinuities along the old lines of partition.
Even when data analysts do use sophisticated methods, they tend to see them as a collection of useful tools rather than as parts of a coherent body of knowledge. For that reason, they fail to realize that applying those tools without having the necessary background can do more harm than good. A recent example of this from my daily commute was an episode of the Data Skeptic podcast on the subject of Bayesian A/B testing (or split-sample hypothesis testing, for those of us not in the business world). The podcast’s host expressed excitement at the possible applications of the test and asked if there were general principles guiding its use, to which the guest replied, “Test as much as possible”—apparently unaware of the fact that doing so is a recipe for false-positive results, as the science-savvy web cartoon xkcd once pointed out. A business that followed this strategy would end up designing its strategy around statistical anomalies and flukes rather than meaningful results.
I certainly don’t mean to overstate the ability of social scientists to figure out what makes the world go around using only observational data: there will always be caveats, and no methodology or research design is totally ironclad. But the more I observe the incredible proliferation of data analytics in both the business world and the public sphere, the more convinced I become that their main shortcomings are exactly those areas in which the social sciences excel.
 Hyde, Susan (2011) “The Pseudo-Democrat’s Dilemma: Why Election Monitoring Became an International Norm.” Ithaca: Cornell University Press.
 Acharya, Avidit, Matthew Blackwell, and Maya Sen (2014) “The Political Legacy of American Slavery.” Harvard Kennedy School Faculty Research Working Paper Series RWP14-057.
 Grosfeld, Irena, and Ekaterina Zhuravskaya (2013) “Persistent effects of empires: Evidence from the partitions of Poland,” CEPR Discussion Paper 9371.