Why it’s time to reexamine the binary dichotomy between personal and aggregate data, and what aggregation, synthesization and anonymization mean for the future of data privacy.
Privacy-enhancing technologies are typically used to transform data to enable sharing of private data with third parties. However, not all techniques reduce re-identification risks; the ones that do are not equal in strength, and the mere aggregation of data does not rank high on the de-identification spectrum – at least not without using additional controls.
This is the central argument of a research paper I produced with my colleague Alfred Rossi, “Aggregation, Synthesization and Anonymization: A Call for a Risk-Based Assessment of Anonymization Approaches,” which has been accepted for publication in the Computers Privacy and Data Protection Conference (CPDP)’s forthcoming book. Backed by 20 international academic centers of excellence, including centers in the EU, the US and beyond, CPDP is a global multidisciplinary conference that “offers the cutting edge in legal, regulatory, academic and technological development in privacy and data protection.” Our research challenges the common misconception that aggregate data is anonymized data and introduces an innovative risk-based framework to assess anonymization approaches, which has the potential to be replicable as long as assumptions related to attack methodologies hold.
Understanding the Data Environment
To understand why a data transformation process doesn’t necessarily lead to anonymized data, we must look beyond the output data and inspect the technique as applied in its wider environment. Doing so brings to light hidden assumptions about an adversary’s external and/or prior knowledge. We argue and demonstrate that only in this more complete picture – one where technical and organizational controls are analyzed with respect to a knowledgeable adversary – can risk be correctly evaluated.
However, even multi-billion dollar companies can overlook this critical assessment. The lack of principled de-identification methods have arguably led to scandals like AOL’s inadvertent release of search terms and Netflix’s unauthorized sharing of movie recommendations. Further, it is tempting to believe that if all of the data is combined together and presented as, say, summary statistics, then it has been rendered safe for exchange or publication. The idea that aggregate data is somehow automatically safe is as pervasive and tempting as it is wrong. After all, would the United States Bureau of the Census be looking to employ differential privacy if aggregation was sufficient?
Rethinking and Reassessing Anonymization
How can we assess and select anonymization solutions without having to reinvent the wheel for each type of dataset? Based on prior research, we developed a new analysis of anonymization controls. This is particularly useful in the context of data analytics and machine learning, in which models can remember input data, and works for decentralized techniques, like federated learning. The analysis can be applied to both traditional datasets and aggregate data products, such as summary statistics and generative models, which are typically used to produce synthetic data.
This analysis should inform the writing and interpretation of privacy and data protection requirements, which could have long-range effects on organizations’ data governance practices. For example, the California Consumer Privacy Act (CCPA) distinguishes between two categories of non-personal information: de-identified information and aggregate information. The distinction seems based on the assumption that aggregate information is always higher on the de-identification spectrum than de-identified information, implying that it’s (much) safer than de-identified information, relative to re-identification risks – despite being technically unsound!
A similar assumption underlies the GDPR, though this instance appears more restrictive. GDPR Recital 162, which deals with certain types of processing activities, clearly specifies that “[t]he statistical purpose implies that the result of processing for statistical purposes is not personal data (including pseudonymised data), but aggregate data” ( GDPR, Recital 162). In short, the GDPR also draws a distinction between personal data and aggregates, albeit in a non-binding way. However, technically speaking, this specification does not hold true in all cases.
How do we reconcile technical insight and legal interpretation? In order to effectively incentivize best practices in data governance, we should start by taking a nuanced look at both CCPA and the GDPR. While the definition of aggregate information under CCPA does not expressly require a combination of technical and organizational controls, the regulatory goal is that at the end of the aggregation process, data outputs are not directly or reasonably linkable to any consumer or household. Yet, this can only be achieved if, on top of the aggregation process itself, a combination of technical and organizational measures is implemented to transform the data and control the data environment. With regards to the GDPR, we suggest that the exclusion of aggregates from the remit of the regulation should not be systematic.
Redefining the Personal and Anonymized Data Relationship
Creating a uniform approach to assessing anonymization solutions’ strengths by considering both technical and organizational controls makes it possible to compare data transformation techniques, including synthesization, and characterize their outputs. Why? Even if synthetic data is considered a valid alternative to original data, at the end of the day synthetic data is data sampled from a model derived from aggregate data. Given sufficient data, it becomes possible to approximate the model parameters, which are themselves aggregates of the original data, thereby extending attacks on aggregates of the original data to synthetic data. Additionally, a risk-based approach provides an objective measure of anonymization approaches and has the potential to be replicable, as long as assumptions related to attack methodologies hold.
Ultimately, it should emerge that the binary dichotomy between personal and anonymized data is misleading for two primary reasons:
- Simply looking at the data is insufficient to legally characterize it. Data context is key to protecting data and assessing the strength of its anonymization.
- The potential for inferences should drive the anonymization approach, rather than actual inferences. Taking a proactive approach when forming privacy and data protection requirements will help form comprehensive policy that ensures sensitive information doesn’t slip through the cracks.
Recognizing and actively solving for the opportunities that exist within anonymization techniques will help protect personal data across industries and enhance overall data security.
To learn more about our full report, “Aggregation, Synthesization and Anonymization: A Call for a Risk-Based Assessment of Anonymization Approaches,” get in touch with us at email@example.com or @ImmutaData on Twitter.