We heard from several of our customers that managing large shared metastores can be challenging, primarily because users have access to all data stored in a cluster’s managed tables. Table ACLs address part of this challenge by protecting tables for SQL and Python after enabling table access control for the cluster.
But with R & Scala notebooks, it becomes difficult to manage DDL operations such as ALTER, CREATE, DELETE, UPDATE, DROP statements within a cluster where table access controls are either not supported or enabled.
How to manage shared Metastores using Immuta across all Databricks clusters
In the Summer ‘20 Release of Immuta for Databricks, we introduced fine-grained access controls for R & Scala. While Immuta provides automated security and privacy controls for Databricks environments where all rules are dynamically enforced on Spark jobs, the focus of this article is on table access controls to manage DDL operations.
Our customers have raised the need to protect against DDL operations that modify tables for different users—such as data scientists working in R or data engineers working in Scala—across a standard Databricks cluster. To illustrate, I created a table “sumit.boxes” last week but later accidentally deleted “default.boxes” created by our CTO working on a demo in the same cluster without Immuta enabled. OK is not actually ok. [expletive]
While our customers are definitely smarter than me, it’s important for them to deliver a data platform experience in which rules are enforceable and auditable, especially in sensitive data environments.
How to control create/drop/alter operations in Databricks using Immuta
Immuta’s fine-grained access controls for R & Scala notebooks can work on any cluster, building upon existing support for SQL & Python, but not requiring table access control to be enabled on a cluster.
Here are the high level steps to protect against unintended create/drop/alter operations:
- Configure Immuta for your Databricks cluster. The free trial includes an in-product guide that walks you step-by-step through the process of configuring your Databricks cluster and installing the plugin. Or, you can review the installation guide for details and prerequisites.
- Register the table(s) you want to expose to that cluster. This is a virtual reference, so no data is actually moved to Immuta.
With these capabilities now configured, when I try to drop a table in the wrong database, I get an error based on Immuta’s access controls. By default, these protections deny all DDL operations to alter a table or its data.
Error in SQL statement: AnalysisException: firstname.lastname@example.org is unable to perform this operation on the database default outside of Immuta workspaces. The user does not have a current project set in Immuta, which is required to access a workspace database;; DropTableCommand `default`.`boxes`, false, false, false
But since I’m working within the Immuta enabled cluster, I am able to create a transformation and fetch data from the new table with Immuta’s access controls using an R notebook (which is pretty cool).
How to enable create/drop/alter permissions
The new secure data collaboration capability in Immuta uses the concept of “Immuta Projects” to manage WRITE operations transparently in Databricks clusters.
Here are the high level steps to safely permit create/drop/alter operations:
- Create a native Databricks workspace by specifying the storage layer either on AWS or Azure which I have named “mydbws.
- Create an “Immuta Project,” which provides a safe collaboration space in which the data platform admin role can grant access to specific data sets—with data policies applied— and enforce rules for how members of that project collaborate on available data. To do this, click on the “Project” icon on the left menu and then “+ New Project.” Specify the purpose for use (in my case, HR analytics) and decide if you want members to acknowledge their intended use for auditing purposes (in my case, I do).
- From Databricks, all users who are members of the project can now work safely together in the cluster to create and modify tables in the native workspaces managed by Immuta.
Get started protecting all of your clusters
Beyond managing DDL operations, Immuta enables teams to manage fine-grained access controls across different users—with further protections against data leaks when users write data to a given cluster where less privileged users can view it.
If you’re interested in learning more about safe data collaboration in Databricks clusters, as well as automated data-level security and privacy controls, I’d love to walk you through the capabilities that work transparently with Databricks.