Big Data: Toward A Richer Social Science

by Alex Pentland, MIT and author, Social Physics (Penguin Press) http://hd.media.mit.edu

ABUJA, COTE D’IVOIRE.   Arnaud, a sergeant in the city police force, was worried about rising ethnic tensions. With a few key strokes on his computer, he mapped the boundaries between the feuding ethnic groups and altered the daily patrol schedule to focus more on those areas. Sandrine, who managed the local public health systems, was working to contain a recent outbreak of an infectious disease in one of the city’s shantytowns. With a few key strokes she pulled up a map of daily mobility patterns in and out of the shantytown, and then alerted the clinics where the disease was most likely to appear next. Moussa, an administrator for the city bus system, was concerned about rising overcrowding along certain routes. So he called up a map of the aggregate home-work transit patterns and discovered that they had significantly changed in the last month. Using this updated information, he began to change the bus routes to better handle the daily commute.

Although this may read as a piece of science fiction, surprisingly each of these examples is real except for the names. They are examples of big data being applied to very real and current social problems in the country of Ivory Coast, and were part of the recent Data for Development initiative led by Nicolas de Cordes of Orange, Vincent Blondel of Louvin University, me at M.I.T., and the government of the Cote d’Ivoire ( http://www.d4d.orange.com/en/home ). Ninety research groups from all around the world participated in this big data initiative and show that using only aggregated anonymous data researchers can produce many important social insights.

TOWARD A RICHER SOCIAL SCIENCE

Most current social science is based on either analysis of laboratory experiments or on surveys. As a consequence of the `hand labor’ required, experiments are expensive and time consuming to conduct.   More recently, techniques such as `experience sampling’ using text messages on mobile phones, and on-line web experiments have reduced the cost of collecting data and made it possible to collect data more frequently. However, such data are typically still subjective, limited to a small number of questions, and suffer either from having only a limited, carefully controlled range of conditions or having very sparse information about context. Even these new techniques are still far from an ideal of continuous, objective observation in natural conditions.

These limitations have practical consequences. While social science has discovered many important phenomena, it is rare to fully understand how separate phenomena interact or in what contexts they operate. As a consequence, it is generally impossible to predict the strength of a particular effect in a particular circumstance. This leaves social science more a qualitative than a quantitative science.

As an alternative, let us instead imagine the ability to place an fMRI-like imaging chamber around an entire community, and then the ability to record and display every facet and dimension of behavior, communication, and social interaction among its members. Now, think about doing this for up to several years, while the members of the community go about their everyday lives; this is an idealization of what is known as a living lab.

While it is not (yet) possible to achieve this level of detailed observation, it is possible to come surprisingly close through the use of `big data.’ Big data is the newly ubiquitous digital data now available about all aspects of human life, mainly the digital breadcrumbs that we all leave behind us as we move through the world—mobile phone call records, credit card transactions, and GPS location fixes, among other types of data. These data tell the story of our lives by recording what we have chosen to do. They are very different than what you can find on Facebook; postings on Facebook are what people choose to tell each other, presented according to the standards of the day. Who we actually are is more accurately determined by where we spend our time, who we talk to, and which things we buy, and not just by what we say we do.

During the last decade my research group developed the ability to build and deploy such living labs, measuring entire social organisms—groups, companies, and whole communities—on a millisecond-by-millisecond basis for up to years at a time. The method is simple: Measurements are made by collecting digital breadcrumbs such as the sensors from cell phones, postings on social media, purchases with credit cards, and more. By combining these fine-grain, objective measurements with traditional surveys and other social science tools, we can provide add both richer context and greater detail to the traditional measurement techniques. In many cases this allows us to predict effect sizes accurately even in complex, natural circumstances. We can also estimate subjective biases, and compare socially-constructed reality to objective reality.

Using this methodology we have found, for instance, that there are behavioral markers that signal the impending onset of flu, a fact that is now being used commercially by Ginger.io.  We also found that certain aggregate changes in behavior, for instance, elders abandoning a favorite town square, predict future high crime rates. And we have discovered that changes in whom you associate with accurately predict changes in voting behavior. But what is perhaps just as remarkable is that these living lab studies took only days from conception to launch, and their cost was essentially zero, as is explained in the privacy and data ownership section of this note.

The living lab tools we have constructed are now being used by many researchers. What they find when they begin using such rich, continuous data is that the scientific method as normally practiced no longer functions well, because there are so many potential connections that our standard statistical tools generate nonsense result. Almost everything is significant with p ≈ 0, and so any analysis must instead rely on how much variance is accounted for as a function of contextual variables (which everyone should have been doing all along, but that is another argument).

More seriously, when an experiment measures a large number of contextual variables it is typical to discover that many variables have significant predictive power, and so it is critical to ask which variable or combination of variables is causal. As a consequence it becomes difficult to form a limited, testable number of clear hypotheses for A-B testing. Instead, with such rich data it is important to use new, more sophisticated mathematical tools to test the causality of connections in the real world. Researchers can no longer rely on simplified laboratory environments; instead they need to actually do the experiments in the real world, and usually on massive, high-dimensional streams of data such as produced by living laboratories, since these sorts of rich data are required to quantitatively characterize the complex interactions typical of human behavior.

An example of such a living lab is the ‘open data city’ I helped launch last year with the city of Trento in Italy, along with Telecom Italia, Telefónica, the research university Fondazione Bruno Kessler, the Institute for Data Driven Design, and local companies. This living lab, like earlier ones my research group conducted on the MIT campus, has the approval and informed consent of all of its participants—they know that they are part of a large experiment whose goal is to better understand human behavior and to use that knowledge to invent a better way of living. More detail on these living labs can be found at http://realitycommons.media.mit.edu and http://www.mobileterritoriallab.eu/.

PRIVACY AND DATA OWNERSHIP

To build living labs that produce these sort of dense, continuous measurements, new legal and software tools had to be developed in order to protect the rights and privacy of the people in these labs, to insure that they are fully informed about what is happening to their data, and that that they maintain the right to opt out at any time. These `big data’ solutions, originally developed for human subjects research, have played an important role as examples in government policy debates over personal privacy, and have helped to shape both the US Consumer Privacy Bill of Rights and the EU Data Protection acts.

The approach taken is to give participants direct legal control over sharing personal data not only with researchers but also among themselves and with commercial and civic entities. The participants `own’ the data being collected, in the sense that they have legal rights of ownership, and can control where it goes. This has an interesting consequence for social and medical science: since dense data is being continuously collected into the users’ personal data stores, conducting a new experiment simply requires obtaining informed consent from the participants. The time and cost of recruiting new subjects and making new measurements is zero since the participants and their data already exist, and as a consequence both the cost and time required to conduct a new experiment is cut dramatically.

When Dutch lens makers created the first practical lenses, and thus enabled allowed researchers to build the first microscopes and telescopes, they opened up broad new scientific vistas. Today, the new technology of living labs — observing human behavior by collecting a community’s digital breadcrumbs — is beginning to give researchers a more complete view of life in all its complexity—and is, I believe, the future of social science.