Article

How k-Anonymization is Making Health Data More Secure

As officials race to get a handle on COVID-19, the importance of collecting and sharing health data has never been more clear. The virus’ unpredictable and widespread nature makes tracing and testing critical to understanding transmission and treatment. Yet, this also poses a threat to personal health data: with data being collected in as many as 500 cities, millions of individual records are at stake. 

According to HIPAA, datasets meet compliance standards if there is a “very small” risk that individuals can be identified from the data. History has shown, however, that health data is easy to re-identify if not adequately protected. What measures can data architects and engineers implement to reduce the risk, particularly when the need is so urgent? 

k-Anonymization, when combined with other straightforward techniques, is an effective way to enhance health data security — and it’s now more dynamic and easy to apply. Before sharing sensitive health data, consider how this advanced but low lift approach can help provide defense against re-identification and accelerate speed to insights and action on COVID-19 and other health crises.

What is k-anonymization?

Put simply, k-anonymization reduces re-identification risk by anonymizing indirect identifiers, thereby destroying the signal of data. It’s the data equivalent of hiding in a crowd; the more people — or in this case, data points — that are present and generally similar, the harder it is to pick out the details that can identify individuals.

K represents instances of tuples in a dataset; a dataset is k-anonymous when attributes within it are generalized or suppressed until each row is identical to at least k-1 other rows. Therefore, the higher the value of k, the lower the re-identification risk. Just as the larger the crowd is, the less likely you’ll find exactly the person you’re looking for, k-anonymization works particularly well with large datasets. However, if there isn’t enough data to anonymize indirect identifiers, lines of data may have to be redacted, making them unusable. Researchers therefore walk a fine line between privacy and utility, but one that is necessary from a legal and ethical standpoint. 

Why is k-anonymization critical when handling health data?

There are many different types of privacy enhancing techniques (PETs) that can be implemented to protect sensitive health data, so what makes k-anonymization particularly effective? Let’s look at k-anonymization using COVID-19 research as an example. 

Say researchers gather data that includes individuals’ names, locations, genders, ages and COVID-19 diagnoses — information that could help local health departments track cases, anticipate high risk individuals and allocate testing resources. Sharing that raw dataset would be a far cry from satisfying HIPAA’s “very small” re-identification risk, not to mention that it could also open the door for issues down the line such as insurance or rent discrimination. 

Even if names — a direct identifier — were masked, it’s easy to derive a patient’s identity from the remaining information. In fact, Harvard professor Latanya Sweeney found that an alarming 87% of the population can be re-identified based on just their birth date, gender and zip code. In theory, a health clinic employee familiar with each patient and who has access to their basic personal information, like address and birthday, could reasonably match a patient with a positive or negative COVID-19 test. And, since we must always assume a highly motivated and/or knowledgeable attacker exists, it’s safe to call this a high risk, high feasibility situation. 

The question then becomes, how do you anonymize the other data columns without losing the dataset’s integrity or usability? This is where it’s easy to see why health data requires a more comprehensive PET — but also one that won’t completely sacrifice utility for privacy. Let’s look at how k-anonymization strikes the balance.

How does k-anonymization work?

To illustrate how k-anonymization works in practice, we’ll draw on our COVID-19 “dataset”. Recall that researchers have patients’ names, locations, genders, ages and COVID-19 diagnoses. Next, they must identify the most important data for their needs — in this case, the patients’ diagnoses. The diagnoses, along with patients’ names, are direct identifiers; however, the names aren’t critical to the research goal, and therefore can be masked. As discussed, the remaining data are all indirect identifiers that could be combined with other data to reveal a patient’s identity, so must be generalized or suppressed in order to mitigate risk of re-identification. 

A dataset’s risk level is dependent on its most at-risk individual. That means if a single patient has a different location, gender or age than every other patient, there is a 100% risk of re-identification. Let’s say of three females in the dataset, only one is from a certain city; researchers can suppress gender so that k=3, reducing re-identification risk to 33% — certainly a step in the right direction, but still well above the CMS standard of 9%.

To further reduce re-identification risk, researchers can work with the remaining indirect identifiers of location and age. Imagine in our dataset that only one patient is 25-years-old, but several of the other individuals are the same age. In this case, researchers can generalize the data by showing age ranges (i.e. ≤20, 21-30, ≥31). Depending on the ages of the other patients, this level of anonymization may be sufficient, or researchers may have to generalize or suppress locations. 

If the re-identification probability is still deemed too high once all anonymization options have been exhausted, researchers’ final options are to add more rows or collect informed consent from the patients; otherwise, they can’t legally or ethically share the dataset. It quickly becomes easy to see why k-anonymization works best on larger datasets — but also how in the right data environment, it provides a high level of privacy and utility. When combined with other PETs, like randomized response and sampling, it can help protect data from collection to analysis. 

Why is now the time for k-anonymization?

k-Anonymization is nothing new, but its practicality is. Custom coding and ETL processes have historically slowed the technique’s time-to-data and scalability, but Immuta has made it even easier to implement with a built-in feature that applies k-anonymization on the fly based on a simple policy that can be enforced on any database across your organization eliminating the manual, code-based or ETL approaches. This enables you to rapidly derive value from any sensitive data set without worrying about legal and privacy concerns and without creating data copies. 

The rapid community spread of COVID-19 has laid bare how critical it is to share health data as quickly as possible in order to predict outbreaks, conduct contact tracing and route critical medical supplies and PPE. However, privacy can’t be sacrificed for speed, and without utility, data-driven decisions can’t be made. k-Anonymization can help transform, analyze and share secure data at scale, making it an important privacy enhancing technique, not just now, but well into the future.

Don’t wait to start implementing PETs to protect your data. Learn how with our guide to Reducing Re-Identification Risk in Health Data.