Lifting and shifting data science operations to the cloud is expensive and risky. Immuta 2.3 – available today – can help save you over 60% of cloud infrastructure costs while enforcing privacy controls across all of your enterprise data.
How do we do it?
As part of Immuta’s 2.3 release, we’ve expanded Database/Hadoop/Spark data privacy controls to support separation of compute and storage for batch workloads. This is important for a multitude of reasons, and covers several forward-leaning concepts that will save your organization time and money. Let’s quickly describe some of those concepts.
Separating Compute and Storage: This is a cloud concept (with some on-premises implementations) that allows data to sit at rest in storage (think Amazon Web Services S3), and when computations against that data are needed, only then are compute servers spun up, transiently, to process the externally stored data. This allows you to not require a Hadoop instance to exist in perpetuity and thus saves infrastructure costs.
Data Privacy Policies: Think Health Insurance Portability and Accountability Act (HIPAA), which can require masking of certain Personally Identifying Information (PII), or EU General Data Protection Regulation (GDPR), where rows of data may need to be redacted based on the user’s operating country, to name just a few examples of policies we support.
The most common approach to enforcing data privacy controls is to create pseudonymized/anonymized copies of data and then wrap security controls around those copies. But making copies of data for the purpose of data privacy controls is what we consider an anti-pattern (it’s a seemingly good approach that’s actually really bad) for a number of reasons.
- To start with, this approach is based off of making copies, which costs storage dollars. Every unique combination of policy to user type will require a new copy of the data — this can get out of hand very quickly.
- Adding to that cost is the effort required to write and manage the code that creates the copies. This can be difficult to change when policies change, and create huge costs in complexity and employee work hours.
- The copies (static snapshots) will always lag behind the raw data, which limits analysis. Often, appending updates to copies is not possible based on the privacy technique being used, and therefore the entire source must be re-pseudonymized/anonymized from scratch.
- Last, your compute workloads cannot be multi-tenant. This means you must create compute clusters per policy-enforcing copy of data to ensure the users don’t cross-contaminate data within the compute cluster (think AWS EMR instances).
In short, you’ve potentially removed any and all benefit you created by migrating to the cloud.
On AWS, this anti-pattern looks like:
Immuta solves this problem for you by allowing the dynamic enforcement of data policies just-in-time, as your analytical jobs are computing. In other words, you only need a single live copy of the raw data. Also, because Immuta manages the policies just-in-time tied to the user and the data, this allows for multi-tenancy on the cluster.
On AWS this looks like:
The benefits to this approach become abundantly clear when operating in the cloud where the point is to manage workloads transiently in order to save cost.
A Look at the Numbers
We’ll start with our variable: X. X = the amount of data policy combinations that are possible across the intersection of users and policies that need to be enforced. Immuta takes the complexity out of understanding these policy combinations by making it easy to manage through our policy authoring and rules engine.
Once you have an estimate of X, you can work backwards to your actual infrastructure savings:
E = EMR instance costs
S = S3 Storage costs
The formula: (E+S) – ((E+S) / X)
Starting with only three policy combinations, you are saving 66% on your compute and storage costs. That savings number grows as you increase your policy combinations required:
With Immuta enforcing the data policy controls dynamically, your storage cost remains static and your compute is tied completely to utilization rather than policies – no matter how many policy combinations. You will need more compute resources if your user count grows (regardless of multi-tenancy); however, transient EMR clusters are a fraction of the cost of the storage costs.
The real magic here is that you’ve removed cost and effort as the primary decision factor for how you enforce your data privacy controls. In other words, organizations have historically defaulted to “no, you can’t touch that data,” or “you can only see this highly anonymized version,” because of the issues associated to the anti-pattern above.
Download the Cost Savings Solution Brief here.