Abstract
Big data is all the rage; using large data sets promises to give us new insights into questions that have been difficult or impossible to answer in the past. This is especially true in fields such as medicine and the social sciences, where large amounts of data can be gathered and mined to find insightful relationships among variables. Data in such fields involves humans, however, and thus raises issues of privacy that are not faced by fields such as physics or astronomy. Such privacy issues become more pronounced when researchers try to share their data with others. Data sharing is a core feature of big-data science, allowing others to verify research that has been done and to pursue other lines of inquiry that the original researchers may not have attempted. But sharing data about human subjects triggers a number of regulatory regimes designed to protect the privacy of those subjects. Sharing medical data, for example, requires adherence to HIPAA (Health Insurance Portability and Accountability Act); sharing educational data triggers the requirements of FERPA (Family Educational Rights to Privacy Act). These laws require that, to share data generally, the data be de-identified or anonymized (note that, for the purposes of this article, these terms are interchangeable). While FERPA and HIPAA define the notion of de-identification slightly differently, the core idea is that if a data set has certain values removed, the individuals whose data is in the set cannot be identified, and their privacy will be preserved. Previous research has looked at how well these requirements protect the identities of those whose data is in a data set.2 Violations of privacy, like re-identification, generally work by linking data from a de-identified data set with outside data sources. It is often surprising how little information is needed to re-identify a subject. More recent research has shown a different, and perhaps more troubling, aspect of de-identification. These studies have shown that the conclusions one can draw from a deidentified data set are significantly different from those that would be drawn when the original data set is used.1 Indeed, it appears that the process of de-identification makes it difficult or impossible to use a de-identified (and therefore easily sharable) version of a data set either to verify conclusions drawn from the original data set or to do new science that will be meaningful. This would seem to put big-data social science in the uncomfortable position of having either to reject notions of privacy or to accept that data cannot be easily shared, neither of which are tenable positions. This article looks at a particular data set, generated by the MOOCs (massive open online courses) offered through the edX platform by Harvard University and the Massachusetts Institute of Technology during the first year of those offerings. It examines which aspects of the de-identification process for that data set caused it to change significantly, and it presents a different approach to deidentification that shows promise to allow both sharing and privacy.
Cite
CITATION STYLE
Angiuli, O., Blitzstein, J., & Waldo, J. (2015). How to de-identify your data. Queue, 13(8), 20–39. https://doi.org/10.1145/2838344.2838930
Register to see more suggestions
Mendeley helps you to discover research relevant for your work.