The data your company collects is one of its most valuable assets. And yet, at a time when data breaches regularly make headline news, the value of that data hinges on your ability to not only derive meaningful insights from it, but also to keep it safe. In fact, with data privacy regulations like California’s CCPA (soon to be the CPRA) and Europe’s GDPR — not to mention your company’s own increasingly stringent internal data rules and security policies — data security has become a top priority for every business, particularly when personally identifiable information (PII) or protected health information (PHI) is involved.
Effectively, this means that as a data engineer or architect, you have to strike the right balance between giving your company’s stakeholders access to data at the speed they have come to expect, while also taking all of the steps necessary to preserve your data’s privacy and utility. Building privacy with mathematical guarantees against re-identification can require a research team of PhDs in mathematics. As a Databricks team, Immuta helps you solve this challenge with existing skills and this article will explore three popular privacy enhancing capabilities. When you access these advanced capabilities through Immuta’s native integration, you will always have the privacy controls you need to reduce the re-identification risk as you share data in new products and services.
Let’s take a look at each of these controls in turn:
1. Differential Privacy
Organizations are collecting more data now than ever before. While that has fueled analytic capabilities, the risk is that no single piece of data can exist in a vacuum. To illustrate this, consider the US Census Bureau, which collects demographic information — including PII — to help distribute the House of Representatives, draw legislative districts, and allocate hundreds of billions of dollars in federal funds. However, the Bureau faces the challenge of making millions of Americans’ data available and shareable for analysis and use in Databricks with Apache Spark, while also protecting respondents’ privacy. That’s where differential privacy comes in.
Differential privacy is a privacy-preserving strategy that aims to mathematically limit an outsider’s ability to confidently use the output of an analysis to make inferences about its input. That, in turn, allows those individuals providing their personal data to credibly deny their participation in the input.
Differential privacy requires the data analysis mechanism to produce the same answers with similar probabilities over any pair of databases that differ by a single row. Simply put, Immuta injects noise into the data analysis, thus rendering inference attacks nearly impossible.
2. Randomized Response
While differential privacy makes it possible for people to credibly deny their participation in a data input, randomized response (or local differential privacy) makes it possible for participating individuals to credibly deny the contents of their participation records. In other words, this approach allows people to share sensitive or even potentially embarrassing data, including PII and PHI, confidentially.
Like differential privacy, randomized response uses randomization to enhance privacy. However, unlike differential privacy, it is enforced prior to submission and formal constraints are applied to the randomized substitutions. This means that any chosen substitution must be nearly — though not necessarily exactly — as likely to arise from any given input. As a result, all potential inputs look plausible to an attacker wishing to undo the randomized substitution.
Since the randomized response technique is applied prior to the data leaving a device, data subjects are assured protection from the moment of submission — and their data remains privatized, even in the case of another breach.
3. k-Anonymization
k-Anonymization is the data equivalent of playing Where’s Waldo? The more cartoon characters — or data points — there are that generally look similar, the harder it is to pick out the one you are looking for. k-Anonymization makes it more difficult for attackers to find details that can help identify individuals in your data set. This approach reduces re-identification risk by anonymizing indirect identifiers, thereby destroying the data’s signal.
In k-anonymization, k represents instances of tuples in a data set. A data set is k-anonymous when attributes within it are generalized or suppressed until each row is identical to at least k-1 other rows. Therefore, the higher the value of k, the lower the re-identification risk. Just as the larger the crowd is, the less likely you will find exactly the person you are looking for, k-anonymization works particularly well with large data sets. However, lines of data may have to be redacted if there isn’t enough data to anonymize indirect identifiers.
Since k-anonymization can help transform, analyze, and share secure data at scale, it is a particularly important privacy enhancing technique (PET) for Databricks users dealing with large sets of sensitive data.
How to Implement Automated Privacy Controls in Databricks
If a column of your data contains sensitive information, you can apply a masking policy to conceal it from users unless they possess a specific attribute, belong to a group, or are performing a set function that you have defined. Here’s how:
1. Navigate to the Data Source Overview tab and click on the Policies tab in the top navigation bar.
2. In the Data Policies menu, click New Policy.
3. Select mask in the first dropdown.
4. Select a custom masking type in the next dropdown menu. There are several options to choose from, including using hashing, a regex, randomized response, or a constant, but let’s say in this case you want to use k-anonymization.
5. Choose to mask columns using fingerprint or requiring a group size of at least to either auto populate or manually select the group size. Note that using fingerprint requires running Immuta’s fingerprint service on the data source first. The fingerprint service queries against Databricks to get the number of each possible group of values in the data source, and returns the predicates for each column.
6. Select the appropriate parameters from the column in the next dropdown.
7. Choose the condition that will drive the policy: for or when, which allows conditional masking driven by data in the row. For this example, we will use for.
8. From the next dropdown, select the appropriate condition: everyone, everyone except, or everyone who.
9. In the final dropdown, choose the group, purpose, or key attribute/key value pair for your condition.
10. Click Create to finish and save your policy. The policy will be enforced natively on read from Databricks and applied to the plan Spark builds for users’ queries from the Notebook. The final result might look something like this, where certain combinations of state and gender have been suppressed, but a statistically relevant number can still be analyzed.
Striking the Balance Between Security and Utility
Automated privacy enhancing capabilities, like the ones Immuta natively offers to Databricks users, are critical for reducing re-identification risk, maximizing the tradeoff between data utility and privacy, and accelerating speed to data access.
To learn more, check out our eBook, A Guide to Data Access Governance with Immuta & Databricks, which specifies exactly how you can use Immuta to unlock more value from your Databricks data.