Blog
Secure Access to Databricks Without Compromising Performance & Scalability

Secure Access to Databricks Without Compromising Performance & Scalability

STEVE TOUW

Published July 26, 2021

Last edited: April 3, 2025

You need scalability and performance

Scalability and performance go hand-in-hand. If you have great performance, but not a scalable solution, you haven’t solved much. The inverse is also true. This is why the cloud is so successful, and tools like Databricks work so well – they can scale on the cloud by spinning up performance-appropriate compute to crunch data with Spark on demand: scalability and performance.

I already spent some time explaining how Immuta’s attribute-based access control (ABAC) model provides a highly scalable solution — 75X more scalable than role-based (RBAC) solutions like Ranger, as proven by GigaOm’s independent analysis. But that’s only half the picture; without performance, scalability is meaningless.

We’re continuously enhancing the Immuta platform to optimize scalability and performance. In this article, we’ll show you:

How Immuta enforcement on Databricks works transparently
Real results from repeatable TPC-DS performance benchmarks for Immuta-protected Databricks clusters with minimal overhead
New unique innovation from Immuta that can under certain scenarios increase performance by 89%

First, let’s start with a refresher on Immuta’s native integration with Databricks.

Databricks Enforcement

Immuta is not a virtualization solution. We integrate directly with our cloud data platform partners at the data level, providing a non-invasive and performant enforcement point. This blog details our performance results and some new performance enhancements on Databricks that we are extremely proud of; our performance coupled with our scalability provides a modern cloud data access control solution.

First, a quick tangent on how Immuta works in Databricks. Immuta is deployed through a lightweight plugin into the Databricks cluster that leverages open source hooks in the Spark planner, which allows us to rewrite queries to enforce complex policy. Remember, Immuta applies policy using a dynamic and scalable ABAC model coupled with this native enforcement architecture, providing the best of both worlds: scalability and performance.

This is important for Immuta customers that deliver cloud data innovation on Databricks, such as AstrumU, S&P Global, Janus Henderson, and others.

A Guide to Automated Data Access

In Databricks Using Immuta

Download Ebook

Databricks Performance

Using the same TPC-DS tests¹ that Databricks referenced in their performance blog post on Adaptive Query Execution: Speeding Up Spark SQL at Runtime, and masking 107 STRING columns across 14 TPC-DS tables with salted SHA-256 hashing, you can see very little variation in performance against the 100GB scale factor of TPC-DS. In fact, for most queries the latency was slightly over a second.

Other details:

DBR version: 8.1
Machine type: i3.2xlarge
Autoscaling off
Number of workers: 4
Queries run twice each (used average)

You can download the raw results data here.

The key takeaway is that applying intensive masking, such as hashing, on many columns (107 of them) adds very little overhead in the TPC-DS benchmarks (more details in the next section).

Ok, so how have we made our Databricks integration even more performant?

Smart Mask Ordering

If you can micromanage when value masking occurs, you can get significant performance gains. For example, naïvely creating a view that hashes a column will require all the values in that column to be hashed before other operations can be performed (such as joining). This can add a significant amount of overhead, especially on petabytes of data.

Immuta has implemented complex logic we term “smart mask ordering” that considers the types of queries and the masking techniques to avoid masking until the “last possible second”. This results in negligible overhead for column masking under many scenarios.

TPC-DS queries do all their joins on the INT columns and do very little scanning of the STRING column types. We masked with hashing because hashing retains referential integrity on joins and is an intensive operation; but is limited to STRING types. This is why we saw good performance numbers on the TPC-DS benchmark, even with so many columns masked and no smart mask ordering – there were very few scans performed on the columns. With this in mind, to do a more relevant test, we took a hand-crafted query that does a join across 4 salted SHA-256 hashed columns².

The results are staggering.

Query: select * from synthetic_crime_data a join synthetic_crime_data b on a.address = b.address and a.area_id = b.area_id and a.area_name = b.area_name and a.crime_code = b.crime_code limit 5;

Masked columns: address, area_id, area_name, crime_code masked by sha256 hashing with salt

https://www.immuta.com/wp-content/uploads/2021/07/image1-2-1.png

The key takeaway from this chart is that if you were to create a view that masks columns (not using Immuta), depending on the masking algorithm used, you would see results very similar to the “Immuta w/o smart mask order” column. This is due to the fact that with the view approach, Spark must compute hashed values for each of the four columns on one side of the join (this is a self join so it will leverage ReusedExchanges as an optimization). The amount of hashing operations = the amount of rows x the amount of columns masked, which means a total of 4 billion hash operations for the 1B data set. With smart mask ordering, this is reduced to a total of 5 (yes, five) hash operations (since it happened last and the limit of 5 is applied).

Try it for yourself!

To see the scalability and performance of Immuta in your own Databricks environment, request a 14 day free trial of our SaaS offering.

The Immuta engineering team believes in transparency and reproducibility in any product claims so every organization can verify them in their own environment. Because of this, we released a new performance test notebook that makes it easier for prospective or existing customers to run this same TPC-DS test suite themselves and see the results. You can even combine this performance test with the scalability tests described in the GigaOm article, which also uses these same TPC-DS tables.

Ready to get started?

1. TPC-DS data has been used extensively by Database and Big Data companies for testing performance, scalability and SQL compatibility across a range of Data Warehouse queries — from fast, interactive reports to complex analytics. It reflects a multi-dimensional data model of a retail enterprise selling through 3 channels (stores, web, and catalogs), while the data is sliced across 17 dimensions including Customer, Store, Time, Item, etc. The bulk of the data is contained in the large fact tables: Store Sales, Catalog Sales, Web Sales — representing daily transactions spanning 5 years.

2. You might want to mask a column in a way that does not allow users to join on it, but also masks consistently so the data in that column can be grouped/tracked (such as a hash). This can be managed in Immuta – you can control if a masked column retains referential integrity or not down to the use case the analyst is tackling.

Moving from Legacy BI Extracts to Modern Data Security & Engineering

Before we can talk about modernizing from a legacy Business Intelligence (BI) extract, we need to answer the questions: why are they used? And what are they? The “why” behind extracts boils down to improved query performance on published dashboards. You can see more details about the “why” for data extracts in...

Data Governance Anti-Patterns: The Copy & Paste Data Sharing Method

Anti-patterns are behaviors that take bad problems and lead to even worse solutions. In the world of data governance, they’re everywhere. Today’s anti-pattern probably isn’t thought of as a “pattern” at all because it feels so obvious, and (on the surface) is too “easy” – you share data by giving...

Role-Based Access Control vs. Attribute-Based Access Control

The terms role-based access control and attribute-based access control are well known, but not necessarily well understood — or well defined, for that matter. If attribute-based access control includes user roles, then what is role-based access control? Where is the line drawn? Fundamentally, these data access control terms — role-based access control and attribute-based access control...

your data

Put all your data to work. Safely.

Innovate faster in every area of your business with workflow-driven solutions for data access governance and data marketplaces.

Book a demo

Platform Services

Metadata Registry

Data Discovery & Classification

Policy Entitlement Engine

Unified Audit

Data Domains

Apps

Data Marketplace

Data Access Governance

Ecosystem Partners

Technology Partners

Get Started

Take a tour of Access Governance

Take a tour of Data Marketplace

Schedule a live demo

Find a consulting partner

Data problems we solve

Unify data access control

Publish & find data products

Create & enforce policy

Monitor & audit data usage

Speed business innovation

Roles we empower

Data Product Owner

Data Consumer

Data Steward

Data Governor

Data IT

Industries we transform

Financial Services

Health & Life Sciences

Public Sector

Beyond Discovery: Intelligent Data Provisioning Arrives in Catalogs

Get in the know

Blog

Resource Center

Data Fundamentals

Get a deeper look

Demo Hub

How-To Guides

Schedule a Live Demo

Get connected

Events & Webinars

Sign Up for Our Newsletter

Get support

Documentation

Customer Support

Get inspired

About us

Company

Partners

News

Connect with us

Careers

Upcoming Events

Contact Us

Customer Spotlight