As increasing amounts of digital data are produced and stored online it is important to remember that humans produce much of that data. In an era in which people express themselves on Facebook, follow companies on Twitter, allow their phones’ GPS to track their movements and their online retailers to track their buying decisions, social scientists have tremendous opportunity to help shape the analysis of large-scale data sources to understand human attitudes and behavior.
A particular area of strength among social scientists concerns measurement. In contrast to many other disciplines, social scientists are frequently concerned with a latent variable not easily observed, and must estimate it using some other, more easily observed quantity. For instance, we may be interested in studying how an individual’s political ideology has some effect on their political behavior—e.g., which candidates they support. It is quite difficult to observe something as nebulous as ideology, but social scientists have developed computational methods that we can apply to data to generate an estimate, including simple measures from surveys or behavioral measures like voting records for members of Congress. My own work has shown that digital traces in social media, such as “likes” for a politician on Facebook, can be used to measure ideology as well at a very large scale.
A second area of strength among social scientists concerns the development of causal theories and tests of causal inference. While causal inference is not unique to the social sciences, the problems inherent to developing causal models when considering people’s behavior are often different from those in other domains. For instance, humans select their environments, which makes causal inference difficult. To deal with this, social scientists develop theories that rely on assumptions about the world, along with a wide range of methodological tools, to make causal inference more tractable. In the case of big data, social scientists should play the role of advocating for well-defined theories of human behavior, and for making the assumptions underlying causal tests clear. If we fail to do so, we are likely to understand what the world looks like without having a clear understanding of why it came to be so.
Large-scale data sources also create opportunities for social scientists to conduct research at a scale previously not feasible. Digital traces humans leave behind through their interaction with computers, phones, smart watches, and other digital tools create enormous quantities of data that previously would have been cost prohibitive or impossible to collect. Further, with more people conducting more of their daily lives online, it is possible for social scientific studies to include millions of individuals at once. Through the use of large-scale sources, social scientists are able to study more subtle causal effects through increased statistical power and also to characterize the behavior of ever-larger proportions of the population, thereby using big data both to “zoom in” on small changes and to “zoom out” to examine the effects these small changes have at a societal level.
My particular area of expertise—the study of social networks—has benefited greatly from big data. Social network analysis requires the use of data that traditionally would have been difficult to collect and analyze due to its complexity. Big data and computational tools, however, have largely changed both of these processes. While we have always lived in a network, the ties between individuals have now become more explicit and are more easily tracked and quantified through online interaction, particularly social media. Each friend request we accept, comment we make, Twitter account we follow, or Snapchat we send potentially provides researchers with important information about the social environment we are in. Further, computational tools have advanced such that describing and analyzing a network of millions of individuals is a tractable problem. Not many years ago, either of these would have been impossible.
As our world becomes more computational, and as that change ushers in vast troves of new data about humans, it is critical that social scientists influence how these data are analyzed and the conclusions that are subsequently drawn. Such data offer abundant opportunities to study phenomena of interest at new scale and with increased precision. However, doing so will require careful thought about the processes that have created these data—not only the mechanical processes that translate data from a server to a monitor screen but also the processes through which humans create such data in the first place. If the methods and models we use to understand the data created by humans fail to account for how and why such data were created, we are unlikely to fully appreciate what this kind of data can tell us about human nature.