The new Databricks Enterprise Cloud Service architecture provides powerful network security capabilities, however, a lesser known benefit is that it enables Data-as-a-Service.
What is Data-as-a-Service?
Data-as-a-Service gives you the ability to share data and provide compute and analytical tools along with it, providing data consumers with a full “data experience” in addition to a one-stop-shop for data. More specifically, it allows data providers to expose Databricks compute to data consumers, including consistent, analysis-ready data. This gives data consumers the ability to add their own data to the Databricks environment so they can create a robust set of inputs to drive new analytical models that provide a competitive edge.
How does Enterprise Cloud Service make this possible? At its foundation, Enterprise Cloud Service allows the creation of workspaces in a single VPC, across multiple VPCs in a single AWS account, or across multiple AWS accounts – all mapping to the same Databricks account. You can think of this as a dedicated Databricks URL for each data consumer.
A Guide to Automated Data Access
In Databricks Using ImmutaDownload Ebook
This provides many benefits, including:
- Distinct Databricks workspaces for each data consumer
- Multi-Workspace API that allows you to automate the provisioning of a workspace, and other APIs that allow you to bootstrap it according to your needs
- Partitioning DBU cost per workspace, enabling you to charge individual data consumers for their compute
- Whitelabeling those workspaces, allowing the data provider’s logo to dominate the data consumer experience
- Separation of DBFS and the root bucket so things like cluster metadata are all separated completely between workspaces
- Different authentication/identity management mechanisms for each workspace
- Ability to share a single GLUE catalog across workspaces
Initializing the Enterprise Cloud Service data sharing environment is easy. The data provider can store their data once in S3, create a global GLUE Catalog defining their tables once, and then spin up white labeled workspaces for each data consumer. Below is a simplistic diagram of that architecture:
However, this is not the complete picture. As a data provider opening your data internally and externally, you must filter out certain tables or rows of data based on what the data consumer has paid for or what different business units should have access to. This may also include masking sensitive columns for untrusted data consumers while maintaining the ability to open those columns for trusted consumers who have signed a Business Associate Agreement (BAA), for example.
This severely complicates the above diagram because it means you would have to maintain individual copies of data as well as different catalog table definitions for every data consumer workspace. This quickly becomes too complex to manage. This is where the Databrick’s partner, Immuta, completes the Data-as-a-Service architecture.
With Immuta, you are able to reference the single Glue Catalog to build table-, row-, and column-level controls that will enforce policy dynamically, no matter the Databricks workspace. For example, I could build a rule in Immuta to hide any rows that contain data outside of the U.S. for data consumers that are U.S.-based. This rule will dynamically be applied natively in Databricks/Spark based on the user executing the query and their attributes (if in U.S. or not), no matter the Databricks workspace or identity management system.
This is the DaaS architecture secured with Immuta:
Leveraging this architecture allows for massive scalability because you can continue to maintain a single copy of your data and definition of your catalog tables, as well as a single consistent definition of your policies across all data consumers with Immuta. You will easily be able to provide a powerful and secure Data-as-a-Service platform that delights your data consumers.