Article

Mask and Protect Data Across Databricks and Snowflake

Organizations are increasingly deploying Immuta to govern access to multiple cloud services. For example, some of our customers are using Databricks as the primary platform for ETL and data science, and Snowflake as the primary platform for BI. This article will walk through how Immuta works with multiple cloud services to provide centralized data access governance, cross cloud data classification and consistent data masking.

The same concepts apply to other cloud services such as Starburst Presto, Amazon Redshift, Azure Synapse and others, in addition to different relational databases hosted in AWS, Azure or GCP.

For the article steps, below are example rules to implement across cloud services:

  • Mask all PII data across the cloud data ecosystem, which can be hundreds or thousands of tables.
  • Make an exception for the HR department to see PII.

Sensitive Data Stored in Databricks

If we query the data from a Python notebook in Databricks, we can see some of the PII data, which is a mix of indirect identifiers such as gender and state, and direct identifiers such as name and email address.

Sensitive Data Stored in Snowflake

If we query the data from a Snowflake worksheet, we can see some of the PII data with indirect identifiers such as COUNTY_ID and MIDDLE_NAME and direct identifiers such as RESIDENTIAL_ADDRESS.

Automate Sensitive Data Discovery Across Databricks and Snowflake

Immuta provides sensitive data discovery capabilities to automate the detection and classification of sensitive attributes across Databricks and Snowflake, and across your cloud data ecosystem. After registering data sources with Immuta, the catalog will standardize classification and tagging of direct, indirect and sensitive identifiers consistently. This enables you to create policies in a dynamic and scalable way across data platforms.

Consistent Access Control Across Databricks and Snowflake

Using Immuta’s explainable policy builder, you can create a global masking policy to apply data masking across all fields in Databricks and Snowflake. This includes hashing, regular expression, rounding, conditional masking, replacing with null or constant, with reversibility, with format preserving masking and with k-anonymization, as well as external masking.

Note that the policy applies to “everyone except” those possessing an attribute where “Department” is “Human Resources,” which is pulled from an external system. This dynamic approach is also known as attribute-based access control, and it can reduce roles by 100x, making data more manageable and reducing risk for data engineers and architects.

In addition to column controls, Immuta supports row-level filtering and dynamic privacy-enhancing technologies (PETs) such as differential privacy or randomized response.

Native Data Masking Policy Enforced in Databricks

If we run the notebook from Databricks, we now see all the columns tagged as PII by sensitive data discovery are dynamically masked without having to make copies or create and manage views. This policy is enforced natively on Spark jobs in Databricks, which means the underlying data is not being copied or modified. 

Native Data Masking Policy Enforced in Snowflake

If we run the query from the Snowflake worksheet, we now see all the columns tagged as PII by sensitive data discovery are dynamically masked without having to make copies or create and manage views. This policy is enforced natively in Snowflake as a secure view managed by Immuta, so the underlying data is not being copied or modified. 

DIY Approaches for Databricks and Snowflake

There are different approaches to protect sensitive data for each cloud service. 

For Databricks:

You might write ETL code in Python using Spark UDFs to apply this to specific columns and replicate the data for all users except the HR department, who will access the raw data.

(Example from Databricks Engineering Blog)

For Snowflake:

You can start by creating three roles – one for HR, one for non-HR and one to own secure views. 

Create a masking policy:

CREATE OR REPLACE MASKING POLICY voter_mask AS

(val number) returns number ->

CASE

WHEN invoker_role()=’MY_PRIVACY_ADMIN’ THEN val

ELSE null

END;

Apply the policy on all of the PII columns. Below is an example for one of them:

ALTER TABLE PUBLIC.VOTER_TBL MODIFY COLUMN MIDDLE_NAME SET MASKING POLICY voter_mask;

Then create a secure view.

Here is a good write up from the chief BI architect at EA (Electronic Arts).

Governing access to sensitive data in the cloud can be challenging when working with more data, more users and multiple cloud services. The DIY example above is specific to a table and requires very different approaches between Databricks and Snowflake. Immuta provides a consistent way to automate these steps in a consistent, secure way across your cloud data ecosystem. You can explore Immuta further in a free trial, or request a demo.

Start a Free Trial