Immuta Guides
How-to Guides
How to Anonymize Data with Databricks Access Control

How to Anonymize Data with Databricks Access Control

Organizations with large workforces are increasingly analyzing employee data using cloud data platforms such as Databricks in order to optimize performance, engagement, and results. This trend necessitates a new approach to Databricks access control.

In this article, we’ll walk data engineering and operations teams through how to dynamically enforce k-anonymization on a data set for data scientists or analysts using Immuta’s automated approach to data access control for Databricks, including how to audit data use for a compliance team, DPO, or auditor.

[Tip] Explore the evolution of data access control in RBAC vs. ABAC: Future-Proofing Access Control.

K-anonymity is a data privacy preserving model and form of dynamic data masking that anonymizes indirect identifiers to prevent re-identification of individuals from threats such as linkage attacks. This method can be used in conjunction with other data de-identification techniques to further reduce risk of regulatory fines by making a data set exempt from regulations such as HIPAA or GDPR, depending on your use case.

Databricks Access Control with a Sample Human Resources Data Set

This is mock data stored in Databricks in a table named “hr_records” in the default database. It contains highly sensitive columns, including direct identifiers (such as Social Security number and email), indirect identifiers (such as location, start date, tenure, and gender), and sensitive data (such as salary and weight) that can identify an individual employee. In its current state, this may not be suitable to present to data scientists and analysts due to privacy concerns.

Register the Databricks Table with Immuta

After configuring the Immuta artifacts in Databricks, navigate from the Immuta console to the data sources icon on the left and click, + New Data Source, to create a new Databricks connection. Next, select the table “default.hr_records.” No data is ever stored in Immuta since this is a logical table. The fields can be tagged by running Immuta’s built-in sensitive data discovery, importing from existing data catalogs or manually tagging data. Below are example tags (in purple) for some of the identifiers in the table.

Create Dynamic K-anonymization Policy Without Code

There are two primary types of policies you can create to enforce Databricks access control: global policies apply across all data sources based on logical metadata (the tags); and local policies apply to specific data sources. In this example, we’ll enforce a local policy.

Click on policies from the “default_hr_records” data source, and create a new data policy using Immuta’s Policy-as-Code policy builder. This example enforces k-anonymity using the mask, or suppression method on the [state] and [gender] columns tagged in the previous step.

Save Policy to be Actively Enforced

At this point, a fingerprint has been run on the Databricks table to calculate the statistics required for k-anonymization. Users can either use a minimum group size, k, given by the fingerprint, or manually specify the value of k.

Upon policy creation, the fingerprint service will run a query against Databricks to get the counts for each possible group of values in the data source. The fingerprint service will then return the custom predicates for each column. In order to protect identities, the predicates will only contain a whitelist of the values users are allowed to see. We can now see the data policy applied to the Databricks table, default hr_records, that we registered in Immuta. The Policy-as-Code approach means your compliance team or DPO is able to understand the policy in plain English.

Query the Human Resources Data Set with K-anonymity Enforced on Read

The default_hr_records data source is exposed as a table in Databricks under the ‘immuta’ database cluster, and analysts or data scientists are now able to query the table. This is all enforced natively on read from Databricks, meaning that the underlying data is not being modified or copied, and the data access control policies are applied to the plan that Spark builds for a user’s query from the Notebook.

While different field groups get obfuscated per the dynamic k-anonymization policy, utility in the data is preserved. As an example, [state] and [gender] were suppressed as null in the sets of state+gender, where there was risk of re-identification. However, a statistically relevant number of [state] and [gender] combinations can still be analyzed (for example, to determine average salaries).

In this overly simplistic scenario, the k value is 5 and the average salary in Hawaii is $112,949.40. Note the null group for 5 records, where the masking policy is applied for employees that cannot be distinguished from at least k-1 other employees based on [gender] and [state]. This is because Delaware had only 1 male and 4 female employees in the raw data set.

Enhanced Audit Logging for Anonymization in Databricks

Within Immuta, click on Audit to see a report of all of the steps and details. The Spark plan for each query, user, and purpose/intent are available to rapidly demonstrate k-anonymization across [state] and [gender].

Additional Databricks Access Control Resources

This Databricks access control solution was developed by our expert teams of software engineers, legal engineers (legal nerds), and statisticians (math geeks). In addition to dynamic k-anonymization, Immuta automates fine-grained access controls and includes a suite of additional privacy and dynamic data masking techniques for Databricks.

There are OSS solutions available for a static k-anonymization implementation with Apache Spark. This is a simple example for one table, one rule, and one role, without a specific regulation in mind. Our commercial approach using a dynamic implementation (no data copies or ETL) is best suited for data teams working in regulated industries and/or with sensitive data in complex environments, where automation and prebuilt auditing is required to manage data sets, rules, and access at scale.

Platform Services

Metadata Registry

Data Discovery & Classification

Policy Entitlement Engine

Unified Audit

Data Domains

Apps

Data Marketplace

Data Access Governance

Ecosystem Partners

Technology Partners

Get Started

Take a tour of Access Governance

Take a tour of Data Marketplace

Schedule a live demo

Find a consulting partner

Data problems we solve

Unify data access control

Publish & find data products

Create & enforce policy

Monitor & audit data usage

Speed business innovation

Roles we empower

Data Product Owner

Data Consumer

Data Steward

Data Governor

Data IT

Industries we transform

Financial Services

Health & Life Sciences

Public Sector

Beyond Discovery: Intelligent Data Provisioning Arrives in Catalogs

Get in the know

Blog

Resource Center

Data Fundamentals

Get a deeper look

Demo Hub

How-To Guides

Schedule a Live Demo

Get connected

Events & Webinars

Sign Up for Our Newsletter

Get support

Documentation

Customer Support

Get inspired

About us

Company

Partners

News

Connect with us

Careers

Upcoming Events

Contact Us

Customer Spotlight

How to Anonymize Data with Databricks Access Control

Immuta Guides

All Blueprints

Alation

Amazon Web Services

Azure

Databricks

Snowflake