Why Clarifying De-Identification Concepts is Key to Sufficient Data Protection

Data protection law emerged in the 1970’s in Europe as a means to protect against the risks posed by automated or computer-based data processing. As a concept, it thus goes far beyond protecting individuals against the disclosure of nonpublic information, a concern that is still very much at the center of modern US privacy laws, such as the California Consumer Privacy Act (CCPA) or its second iteration, the California Privacy Right Act (CPRA). Privacy laws around the world are progressively being informed by the data protection approach and evolving in response to it. Yet, as time passes, the patchwork of data protection and privacy laws is becoming more complex. There are dozens of new and existing regulations across the world – and each regulation uses different terminology. 

This complexity is particularly problematic when trying to delineate the scope of these laws and the range of obligations imposed upon organizations processing personal information or personal data. 

Let’s take the confusion that usually surrounds the notions of de-identification, pseudonymization, and anonymization. Is it possible to process de-identified, pseudonymized, or anonymized personal information or personal data for data analytics purposes? Depending upon whom you ask the question to, you are likely to get a different answer and a different definition of de-identification, pseudonymization, and anonymization.

Our new white paper suggests that by analyzing the structure of European data protection law, it is possible to extract a data protection grammar, which can facilitate the assessment of similarities and dissimilarities among various frameworks. This is accomplished by taking the European Union General Data Protection Regulation (GDPR) as the most illustrative example of Europe’s data protection laws, and comparing it with the structure underlying modern data protection or privacy laws adopted in other jurisdictions such as CPRA, the Canadian Personal Information Protection and Electronic Documents Act (PIPEDA), the Brazilian General Personal Data Protection Law (GPDPL), and others. To be clear, while a data protection grammar is useful for comparison purposes, it does not imply that similar problems get similar answers across jurisdictions. 

The basic postulate of this data protection grammar is that both core structural rules (i.e., a data protection syntax) and a concise set of lexical items are embedded within the new generation of data protection and privacy laws. This grammar therefore proves particularly useful in comparing the scope and effects of these frameworks, and assessing the capabilities of emerging policy layers built to govern multiple cloud data platforms and meet geographic demands.  

In this blog post, we’ll illustrate how a rigorous lexicon can help clarify concepts such as de-identification, pseudonymization, and anonymization.

What is de-identification?  

Let’s start with a basic observation: identifiability attracts protection. Identifiability implies that a link can be established between the data and an individual. 

On the left hand side of the identifiability spectrum is the absolutist approach: All data that can be linked to an individual, either directly or indirectly, is characterized as personal information. On the right hand side is the relativist approach: All data that can reasonably be linked to an individual, either directly or indirectly, is characterized as personal information or data. 

We posit that at a high level, de-identification means breaking the link between the data and the individual. When the data has undergone a successful process of de-identification, its use and disclosure are made easier. 


Importantly, when the data is considered de-identified, data custodians and/or data recipients are not necessarily relieved of all obligations. In many cases, if the data remains within a closed environment (i.e, not made publicly available), data recipients will be subject to a series of process firewalls and obligations. Hence, as a rule of thumb, de-identification only weakens  the legal protection of the data.  

So, how can we determine that the link between the data and the individual can be considered broken? 

At a high level, the link should be considered broken when personal identifiers have been removed. The challenge is that there are two types of personal identifiers to deal with: direct identifiers and indirect identifiers. 

What are personal identifiers?

Personal identifiers are attribute values that can be used to discriminate among individuals (i.e., can be used to help locate an individual’s records or single them out) and are considered to be available to an attacker. An attribute is available to an attacker when it is publicly available, observable, or attainable. The two main characteristics of personal identifiers are thus distinguishability and availability.

Direct personal identifiers are attribute values that are unique to an individual and are considered to be available to an attacker (such as Social Security number, passport or ID number, or credit card number). 

Indirect personal identifiers are attribute values (such as height, ethnicity, hair color, etc.), that are available to an attacker, and while not unique to an individual, they can be used in combination with other attributes to distinguish an individual.  

What are attack models used for? 

An attack model is a collection of assumptions and constraints on the data environment, and/or the behavior and background knowledge of a potential attacker. In practice, attack models are used to assess the reasonableness of the link established between the data and the individual, and thereby to define the availability of the data to an attacker and, ultimately,  identify the presence of personal identifiers within the data source.

It is crucial to note that detecting identifiers within a data source requires acknowledging information that is not included within the data set. This may include publicly available or observable information, or more generally, information that is available to an attacker. 

How do data protection and privacy frameworks differ? 

The toolset to de-identify data is relatively well established in practice and relies upon a variety of controls. Controls are organizational, legal, or technical measures put in place to reduce re-identification risks. Data controls affect the visibility of the data and include the familiar techniques of tokenization, k-anonymization, and local and global differential privacy. Context controls, on the other hand, affect the data’s environment and include access controls and user segmentation, contracts, training, monitoring, and auditing. Combining data controls and context controls is an effective way to significantly reduce re-identification risks while preserving some level of utility. 

One key difference between data protection and privacy frameworks lies in the way the reasonableness standard that affects the strength of the link between the data and the individual is interpreted. This has direct implications for the range of controls to combine in order to claim that the data should be considered de-identified or anonymized within the meaning of the applicable law. 

As a matter of principle, it’s possible to say that anonymization is the strongest type of de-identification process, in that it requires considering all situationally-relevant potential attackers to determine whether the data is anonymized. It’s worth noting that some regulatory frameworks appear to use the two terms synonymously; in our view, however, it makes more sense to think of anonymization as a stronger form of de-identification, which requires more than transforming the anticipated data recipient into a trusted recipient to reduce the range of available attribute values. 

De-identification under HIPAA

Let’s explain this in context and turn to the US Health Insurance Portability and Accountability Act (HIPAA). Section 45 CFR § 164.514(b) provides that, for a data custodian to claim data is de-identified, a de-identification expert should, applying accepted statistical and scientific principles, “[determine] that the risk is very small that the information could be used, alone or in combination with other reasonably available information, by an anticipated recipient to identify an individual who is a subject of the information.” The HIPAA de-identification test is therefore only concerned with the re-identification means of the anticipated recipient. 

Anonymization under GDPR

Now, let’s turn to GDPR. Recital 26 specifies that “all the means reasonably likely to be used, such as singling out, either by the controller or by another person, to identify the natural person directly or indirectly” should be taken into account to determine whether an individual remains identifiable. There is no restriction with regard to who this other person attempting re-identification is. In practice, the Prosecutor Attack Model should be useful in this context, as it assumes the attacker has access to the datas source and knows the complete set of publicly or otherwise reasonably attainable information about their target, including that which is realistically attainable but may or may not be plausibly readily available. Of note, the introduction in Recital 26 of the expression ‘singling out,’ i.e., locating an individual’s record within a data set, should imply distinguishability of a record within a data set on the available attributes. That this, both distinguishability and availability should be necessary to single out.

GDPR introduces the notion of pseudonymization in Article 4(5). Recital 26 confirms that it is not enough to achieve anonymization. To make sense of this definition and distinguish pseudonymization from anonymization, one must assume that pseudonymization does not require acknowledging information that is not included within the data set to determine whether the data can be attributed to an individual. Consequently, pseudonymization is not concerned with the treatment of indirect identifiers. Recall that indirect identifiers are only identifying to the extent there is access to information that is not included within the data set and in particularly publicly available information that can be matched with attribute values within the data set. 

What pseudonymization could thus refer to in practice is tokenization of direct identifiers combined with key segregation. Tokenization is a specific form of data masking where the replacement value, also called a “token,” has no extrinsic meaning to an attacker. Key segregation means that the key used to generate the token is separated from the pseudonymized data through process firewalls. The token is a new value that is meaningless in other contexts. Further, it is not feasible for an attacker to make inferences about the original data from analysis of the token value. 

So what?

To recap, the detection of both direct and indirect identifiers is key to de-identification and anonymization processes. They also impact assumptions related to the specific attack model applied and, in particular, assumptions related to the range of attackers and the means reasonably likely to be used by these attackers. In addition, pseudonymization is weaker than both de-identification and anonymization.  

Now it becomes possible to identify the common denominator among different data protection frameworks and enforce rules for a variety of frameworks. Data teams for example could be asked to detect both direct and indirect identifiers to de-identify or anonymize the data depending upon the attack model chosen, or mask with hashing direct identifiers to achieve pseudonymization.  

Curious about how it is possible to create data protection rules within Immuta? See for yourself by requesting a demo

Ready to get started?

Request a Demo