Would you like to monetize your internal data by sharing it externally with third parties? Do you see the power of joining your data with data from other organizations to drive stronger and better insights?
The desire to share data with third parties is fueled by the realization that:
- The separation of compute from storage on the cloud opens up the opportunity for you to compute on someone else’s live data using the tools and systems you prefer.
- Massive, scalable compute resources allow you to discover new data sets and execute complex questions across multiple disparate data sets from different companies.
However, that desire is halted by the realization that:
- Your data is your IP, and you cannot trust other organizations with it or reciprocate if they are willing to share theirs.
- Data privacy regulations such as GDPR and CCPA require stringent controls on the data you’ve collected related to your data subjects.
- Lack of open standards to data sharing locks you into proprietary systems and data formats, making it difficult and expensive to collaborate on large-scale data sets.
As data sharing and collaborative data use become increasingly common practices — and central to successful analytics and business results — how do you resolve this tension?
A Two-Pronged Attack
Prong 1: Data anonymization
Certainly it’s important to apply access control within your own walls to protect your sensitive data from your own employees. This is natural for every corporation to do. However, when you begin to think about data access control for third parties outside your company walls, these restrictions are magnified. On one hand, you must protect your IP and meet regulatory requirements; on the other hand, you must make the data valuable enough to the third parties. This is the classic privacy vs. utility tradeoff: Add too much privacy and you lose too much utility; add too much utility and you risk too much from a privacy and IP perspective.
The privacy vs. utility tradeoff is solved through advanced data anonymization techniques. We are not referring to simple hashing or encrypting of data in a column. Think of these more like a dimmer switch rather than a light switch: You can dim the data just enough so there’s still sufficient utility. These anonymization techniques, such as k-anonymization and local differential privacy, allow you to share columns while meeting stringent control requirements.
Prong 2: Separation of compute from storage
Most data sharing is done through a copy-and-paste approach. Organization 1 creates a copy of its sensitive data, anonymizes it in a static way, and shares that version with Organization 2. You can think of this as the Blockbuster approach to data sharing: It is fraught with concerns, such as whether those copies of data will propagate to places you did not anticipate; it is not as valuable as a static set; and it does not allow dynamic anonymization based on specificity of queries.
The separation of compute from storage enables a Netflix approach to data sharing. Now you can expose your data sitting in cloud storage to third parties, allowing them to join their live data with your live data using their own compute resources. This avoids having to centralize the data in either Organization 1’s or Organization 2’s storage (Blockbuster), which would be problematic because the data and anonymization would again be static, making it impossible to allow flexible ad hoc queries without losing utility.
Putting It Into Action with Immuta & Databricks Delta Sharing
How does this work in practice?
Immuta provides automated, advanced anonymization techniques, such as k-anonymization and local differential privacy, to tackle the privacy vs. utility tradeoff problem.
Databricks just announced Delta Sharing, an open protocol for secure data sharing. Delta Sharing allows you to share your data with third parties via scalable techniques, such as pre-signed URLs to shared data, eliminating the need to copy or move data. It is part of the widely adopted open source Delta Lake project, simplifying data sharing with third parties regardless of which computing platforms they use.
Pulling these two capabilities together is powerful. We work with a massive media company that would spend six months scrubbing data and working through “approved queries” in a multi-party computation approach with a third party, just to make a data exchange such as this possible. Now, with Delta Sharing plus Immuta, this is possible immediately with the extra flexibility for ad hoc queries on the company’s own Databricks resources.