Article

How to Anonymize Employee Data Using Databricks Spark

We are seeing more analytics projects at organizations with large workforces on their employee data using cloud data platforms, such as Databricks or Snowflake. This article will walk data analytics and engineering teams through automation and auditing (for a compliance team, DPO or auditor) to dynamically enforce k-anonymization on a data set for data scientists or analysts using Immuta for Databricks.

K-anonymization is a data privacy preserving model that anonymizes indirect identifiers to prevent reidentification of individuals from threats such as linkage attacks. This method can be used in conjunction with other de-identification techniques to further reduce risk exposure to regulatory fines by making a data set exempt from regulations such as HIPAA or GDPR depending on your use case.

Sample Human Resources Data Set

This is mock data stored in Databricks in a table named “hr_records” in the default database. It contains highly sensitive columns including direct identifiers (such as social security number, email), indirect identifiers (such as location, start date, tenure, gender) and sensitive data (such as salary, weight) that can identify an individual employee. This may not be suitable to present as is to data scientists and analysts due to privacy concerns.

Register the Databricks Table with Immuta

After configuring the Immuta artifacts in Databricks, from the Immuta console, click on the data sources icon on the left and click, + New Data Source, to create a new Databricks connection and select the table “default.hr_records”. No data is ever stored in Immuta since this is a logical table. The fields can be tagged by running Immuta’s built-in sensitive data detection, importing from existing data catalog or manually. Below are example tags (in purple) for some of the identifiers in the table.

Immuta Product Screen Shot, veiwing the Data Dictionary tab

Create Dynamic K-anonymization Policy Without Code

Click on policies from the “default_hr_records” data source, and create a new data policy using the no-code policy builder. This example enforces k-anonymity using the mask, or suppression method on the [state] and [gender] columns tagged in the previous step. This is a local policy which applies to this specific data source in contrast to global policies which apply across all data sources based on logical metadata – the tags.

Immuta Product Screen Shot, viewing the policy builder screen

Save Policy to be Actively Enforced

At this point, a fingerprint has been run on the Databricks table to calculate the statistics required for k-anonymization. Users can either use a minimum group size, K, given by the fingerprint or manually specify the value of K. Upon policy creation, the fingerprint service will run a query against Databricks to get the counts for each possible group of values in the data source. The fingerprint service will then return the custom predicates for each column. The predicates will only contain a whitelist of the values users are allowed to see, in order to protect identities. We can now see the data policy applied to the Databricks table, default hr_records, we have registered in Immuta. Your compliance team or DPO is able to understand the policy in plain english.

Immuta Product Screen Shot, viewing a subscription policy

Query the Human Resources Data Set with K-anonymity Enforced on Read

The default_hr_records data source is exposed as a table in Databricks under the ‘immuta’ database cluster, and analysts or data scientists can then query the table. This is all enforced natively on read from Databricks, meaning that the underlying data is not being modified or copied, and the policies are applied to the plan that Spark builds for a user’s query from the Notebook.

While different field groups get obfuscated per the dynamic k-anonymization policy, utility in the data is preserved. As an example, [state] and [gender] was suppressed as null in the sets of state+gender, where there was risk of re identification. However, a statistically relevant number of [state] and [gender] combinations can still be analyzed (for example, to determine average salaries). In this overly simplistic scenario, the K value is 5 and the average salary in Hawaii is 112,949.40. Note the null group for 5 records, where the masking policy is applied for employees that cannot be distinguished from at least k-1 other employees based on [gender] and [state]. This is because Delaware had only 1 male and 4 female employees in the raw data set.

Databricks Product Screen Shot

Enhanced Audit Logging for Anonymization in Databricks

Within Immuta, click on Audit to see a report of all of the steps and details. The Spark plan for each query, user and purpose/intent are available to rapidly demonstrate k-anonymization across [state] and [gender].

Immuta Product Screen Shot, viewing the Governance screen

Additional Resources

Immuta is a Databricks partner. This solution was developed by our expert teams of software engineers, legal engineers (legal nerds) and statisticians (math geeks). In addition to dynamic k-anon, Immuta automates fine-grained access controls and includes a suite of additional privacy techniques for Databricks.

There are OSS solutions available for a static k-anon implementation with Apache Spark. This is a simple example for 1 table, 1 rule and 1 role, without a specific regulation in mind. Our commercial approach using a dynamic implementation (no data copies or ETL) is best suited for data teams working in regulated industries and/or with sensitive data in complex environments, where automation and prebuilt auditing is required to manage data sets, rules and access at scale.

If you find value in this approach, you can:

Request a Demo

Or if you are interested in learning more about privacy preserving techniques for your data scientists, you can watch this webinar or download an ebook.