Blog
A Call for a Risk-Based Assessment of Anonymization Approaches

A Call for a Risk-Based Assessment of Anonymization Approaches

SOPHIE STALLA-BOURDILLON

Published August 24, 2020

Last edited: April 3, 2025

Share this article

Why it’s time to reexamine the binary dichotomy between personal and aggregate data, and what aggregation, synthesisation and anonymisation mean for the future of data privacy.

Privacy-enhancing technologies are typically used to transform data to enable sharing of private data with third parties. However, not all techniques reduce re-identification risks; the ones that do are not equal in strength, and the mere aggregation of data does not rank high on the de-identification spectrum – at least not without using additional controls.

This is the central argument of a research paper I produced with my colleague Alfred Rossi, “Aggregation, Synthesisation and Anonymisation: A Call for a Risk-Based Assessment of Anonymisation Approaches,” which has been published in the Computers Privacy and Data Protection Conference (CPDP)’s book, Data Protection and Privacy: Data Protection and Artificial Intelligence. Backed by 20 international academic centers of excellence, including centers in the EU, the US, and beyond, CPDP is a global multidisciplinary conference that “offers the cutting edge in legal, regulatory, academic, and technological development in privacy and data protection.” Our research challenges the common misconception that aggregate data is anonymised data and introduces an innovative risk-based framework to assess anonymisation approaches, which has the potential to be replicable as long as assumptions related to attack methodologies hold.

Understanding the Data Environment

To understand why a data transformation process doesn’t necessarily lead to anonymised data, we must look beyond the output data and inspect the technique as applied in its wider environment. Doing so brings to light hidden assumptions about an adversary’s external and/or prior knowledge. We argue and demonstrate that only in this more complete picture – one where technical and organisational controls are analysed with respect to a knowledgeable adversary – can risk be correctly evaluated.

However, even multi-billion dollar companies can overlook this critical assessment. The lack of principled de-identification methods have arguably led to scandals like AOL’s inadvertent release of search terms and Netflix’s unauthorised sharing of movie recommendations. Further, it is tempting to believe that if all of the data is combined together and presented as, say, summary statistics, then it has been rendered safe for exchange or publication. The idea that aggregate data is somehow automatically safe is as pervasive and tempting as it is wrong. After all, would the United States Bureau of the Census be looking to employ differential privacy if aggregation was sufficient?

Rethinking and Reassessing Anonymisation

How can we assess and select anonymisation solutions without having to reinvent the wheel for each type of dataset? Based on prior research, we developed a new analysis of anonymisation controls. This is particularly useful in the context of data analytics and machine learning, in which models can remember input data, and works for decentralised techniques, like federated learning. The analysis can be applied to both traditional datasets and aggregate data products, such as summary statistics and generative models, which are typically used to produce synthetic data.

This analysis should inform the writing and interpretation of privacy and data protection requirements, which could have long-range effects on organizations’ data governance practices. For example, the California Consumer Privacy Act (CCPA) distinguishes between two categories of non-personal information: de-identified information and aggregate information. The distinction seems based on the assumption that aggregate information is always higher on the de-identification spectrum than de-identified information, implying that it’s (much) safer than de-identified information, relative to re-identification risks – despite being technically unsound!

A similar assumption underlies the GDPR, though this instance appears more restrictive. GDPR Recital 162, which deals with certain types of processing activities, clearly specifies that “[t]he statistical purpose implies that the result of processing for statistical purposes is not personal data (including pseudonymised data), but aggregate data” ( GDPR, Recital 162). In short, the GDPR also draws a distinction between personal data and aggregates, albeit in a non-binding way. However, technically speaking, this specification does not hold true in all cases.

How do we reconcile technical insight and legal interpretation? In order to effectively incentivize best practices in data governance, we should start by taking a nuanced look at both CCPA and the GDPR. While the definition of aggregate information under CCPA does not expressly require a combination of technical and organizational controls, the regulatory goal is that at the end of the aggregation process, data outputs are not directly or reasonably linkable to any consumer or household. Yet, this can only be achieved if, on top of the aggregation process itself, a combination of technical and organizational measures is implemented to transform the data and control the data environment. With regards to the GDPR, we suggest that the exclusion of aggregates from the remit of the regulation should not be systematic.

Redefining the Personal and Anonymized Data Relationship

Creating a uniform approach to assessing anonymisation solutions’ strengths by considering both technical and organisational controls makes it possible to compare data transformation techniques, including synthesisation, and characterize their outputs. Why? Even if synthetic data is considered a valid alternative to original data, at the end of the day synthetic data is data sampled from a model derived from aggregate data. Given sufficient data, it becomes possible to approximate the model parameters, which are themselves aggregates of the original data, thereby extending attacks on aggregates of the original data to synthetic data. Additionally, a risk-based approach provides an objective measure of anonymisation approaches and has the potential to be replicable, as long as assumptions related to attack methodologies hold.

Ultimately, it should emerge that the binary dichotomy between personal and anonymised data is misleading for two primary reasons:

Simply looking at the data is insufficient to legally characterise it. Data context is key to protecting data and assessing the strength of its anonymisation.
The potential for inferences should drive the anonymisation approach, rather than actual inferences. Taking a proactive approach when forming privacy and data protection requirements will help form comprehensive policy that ensures sensitive information doesn’t slip through the cracks.

Recognizing and actively solving for the opportunities that exist within anonymisation techniques will help protect personal data across industries and enhance overall data security.

To read our full report, “Aggregation, Synthesisation and Anonymisation: A Call for a Risk-Based Assessment of Anonymisation Approaches,” click here. Questions? Get in touch with us at [email protected] or @ImmutaData on Twitter.

3 Emerging Data Security Laws and What They Mean for You

The past few months have been particularly hectic for lawmakers across the European Union (EU). With Ursula von der Leyen’s leadership of the European Commission set to conclude after the 2024 elections, lawmakers have felt the pressure to advance critical files and policies as quickly as possible. Amid this legal...

3 Key Obstacles of Military Data Ops and What They Indicate

There are many lessons that I learned throughout my civilian and military careers, but one that continues to hold true is that obstacles to accessing the data national and strategic analysts need are not caused by a lack of reporting in the field. Rather, it is the technical burdens, lack...

3 Solution Patterns for Data Security Success

You’ve just adopted a data security solution – congratulations! Now what? As with any new piece of technology, it can be tempting to jump in feet first to solve all your problems. After all, the sooner you achieve ROI, the better – right? Not necessarily. Often, we see customers struggle...

your data

Put all your data to work. Safely.

Innovate faster in every area of your business with workflow-driven solutions for data access governance and data marketplaces.

Book a demo

Platform Services

Metadata Registry

Data Discovery & Classification

Policy Entitlement Engine

Unified Audit

Data Domains

Apps

Data Marketplace

Data Access Governance

Ecosystem Partners

Technology Partners

Get Started

Take a tour of Access Governance

Take a tour of Data Marketplace

Schedule a live demo

Find a consulting partner

Data problems we solve

Unify data access control

Publish & find data products

Create & enforce policy

Monitor & audit data usage

Speed business innovation

Roles we empower

Data Product Owner

Data Consumer

Data Steward

Data Governor

Data IT

Industries we transform

Financial Services

Health & Life Sciences

Public Sector

Beyond Discovery: Intelligent Data Provisioning Arrives in Catalogs

Get in the know

Blog

Resource Center

Data Fundamentals

Get a deeper look

Demo Hub

How-To Guides

Schedule a Live Demo

Get connected

Events & Webinars

Sign Up for Our Newsletter

Get support

Documentation

Customer Support

Get inspired

About us

Company

Partners

News

Connect with us

Careers

Upcoming Events

Contact Us

Customer Spotlight