Article

Security and Privacy Layers in Snowflake for Highly Sensitive Data: Part 2

In part one of this two-part series, we explored the basics of encryption at rest and RBAC controls. In this next installment, we’ll examine how to take this a step further and protect your highly sensitive data from your DBAs and cloud providers, while at the same time enforcing fine-grained privacy controls.

This is where things start to get tricky and we get back to our different categories of privacy described in part one. A common idea for maximum protection is: let’s encrypt all the data in the database like we did in storage and I’ll control the key! Encrypting (or tokenizing) the data before it lands in the database can be commonly termed “encrypted on-ingest.” For example, you encrypted that data you are indexing in the Snowflake database before it ever landed in storage (and yes, that means it’s encrypted twice in storage so when it’s decrypted out of storage through that interface discussed in part one, it’s still encrypted because you did it on ingest). If you take this approach, now only the person managing that encryption needs to be trusted. Yay!

immuta-provides-privacy-security-de-risk-cloud-analytics-diagram-4

immuta-provides-privacy-security-de-risk-cloud-analytics-table-5

Sorry, this is a terrible idea.

Why? You just made your database completely worthless because everything is indexed by the encrypted value, as you can see in the above table.  Looking at an even simpler example, you just turned this row in your database:

name: Steve
credit_card_number: 123456789
age: 44
address: 8787 Diamondback Drive, College Park MD

into this:

name: 3098jgp9iwr9iewgp9rweifg9e
credit_card_number: 09q3jepf3q0e8fefg8
age: 09q34ejtgpe9rgpiofjopssd
address: 09q34peigpe9irfdinsifvods

So now, when you run a query, your database doesn’t have that same fancy interface you had between your storage layer, and you won’t get any results back.

SELECT * FROM table_x WHERE name = ‘Steve’

..nothing…

SELECT * FROM table_x WHERE age > 40

…breaks – you just tried to query a text column with a numeric operation…the column had to get converted to text to encrypt it.

But what if you tokenize the data?

It’s the same problem. You are just replacing the garbled encrypted values with just as useless  (albeit more visually pleasing) fake values that also can’t be queried in a meaningful way. The only thing that solves is the column type problem.

Ok, so how about you add a fancy interface that knows how to decrypt on the fly like we have with the storage layer!? Yes, you can absolutely do that, but it only solves a small piece of the problem. This is because the read from storage is “dumb” – it just needs to read blobs of data and have the interface decrypt them. The interface between the database and data must be “smart.” It needs to only pull relevant data from its index based on a query. So the problem is, all the indexes are still based on the encrypted values. To make this work:

SELECT * FROM table_x WHERE name = ‘Steve’

the interface must convert the ‘Steve’ into the encrypted value and push that down to the database so it can find the encrypted version of Steve, like this:

SELECT * FROM table_x WHERE name = ‘3098jgp9iwr9iewgp9rweifg9e’

Voila! You get results, and they can be decrypted on read:

name: Steve
credit_card_number: 123456789
age: 44
address: 8787 Diamondback Drive, College Park MD

The problem is, this really only works well for equality (“=”) queries. … So what do you do in this scenario?

SELECT * FROM table_x WHERE age > 40

The encryption does not take order into account, so this isn’t going to work at all. For example, 40 could get encrypted to xyz and 41 could get encrypted to abc. Tokenization will have the same issue. There are order preserving encryption algorithms and other encryption algorithms that can support fuzzy search, but they are less secure encryption algorithms. If you want to learn more about them, check out this paper from MIT CSAIL.

So where does this leave you? You should encrypt on ingest sparingly or not at all. If you choose to do it sparingly, this is where the privacy categories come into play. Encryption on ingest should be done against your most risky data, typically direct identifiers (essentially anything that must be hidden at all cost from your DBA and cloud provider). This also happens to be the category of data with the least utility, i.e., people won’t be querying it often (and if they are, it will be equality queries) or using it for their analysis. Real analysis is reserved for the juicy categories, like indirect identifiers and sensitive data. Here’s the latest picture:

immuta-provides-privacy-security-de-risk-cloud-analytics-diagram-5

immuta-provides-privacy-security-de-risk-cloud-analytics-table-6

As you can see here, everyone can see some of the data – the indirect identifiers and sensitive data – and we still have table controls the DBA put in place. But only the Key Manager can see all of the data – in this case, everything plus the SSN column. This is good because neither your DBA nor your Cloud Database Provider can see the encrypted, highly sensitive direct identifier (SSN) now. Also, if a data breach occurs, the attacker will only see the encrypted direct identifiers. This is still not perfect because someone could breach your key management system and see the data. It all depends on where the breach occurs in your trust layers because all we did was push the trust somewhere else.

At this point, you still have indirect identifiers that are completely visible to users. Now you’ve come full circle to privacy controls. Many times, a solution to privacy controls is to create multiple versions of the same table where different columns are missing so that you can share those through your table controls as needed. But this isn’t good enough. If you think back to the Netflix example in part one, the movie ratings were both the attack failure point and the primary feature being used for the analysis. This cannot be a yes/no, binary control.

This is where anonymization techniques can be applied; these are algorithms you can use to “fuzz” the data in such a way that it retains some level of utility but also provides a level of privacy, commonly termed Privacy Enhancing Technologies (PETs). One such PET is k-anonymization, which ensures there is no indirect identifier value revealed to the analyst/user that is unique enough to open the direct identifier(s) to a link attack. PETs can be applied through data copies as well, just like we discussed with hiding columns. But why bother doing that – it will result in static copies of data? You can use your fancy new trick, where instead of decrypting-on-the-fly, you fuzz-on-the-fly.

This is powerful because it allows you to query the underlying data in meaningful ways (remember you can’t do that when it’s simply encrypted) and also utilize the results in meaningful ways for your analysis. You are getting value from the data, while at the same time maintaining privacy. These techniques can be applied at query time to both indirect identifiers and sensitive data. Why do it at query time? As the amount of different privacy views into your data increases, it becomes an extract-transform-load (ETL) and role management nightmare to do this with table/data copies. It must be dynamic through the interface. Here’s that picture:

immuta-provides-privacy-security-de-risk-cloud-analytics-diagram-6

immuta-provides-privacy-security-de-risk-cloud-analytics-table-7

Look closely at the gender and race columns. We’ve applied the k-anonymization policy to them, which suppressed highly unique values (either in combination or by themselves) that could lead to a privacy breach with NULLs. But we left the other values in the clear, providing a high level of utility from these columns rather than completely removing them. Immuta offers a wide variety of PETs like k-anonymization which allow customers to make tradeoffs between privacy and utility – fuzz instead of block.

Additionally, Immuta provides an attribute-based access control (ABAC) model rather than a role-based access control model (RBAC). The key difference with ABAC is that complex and completely separate rules can evaluate many different user attributes rather than conflating WHO and WHAT they have access to in a role.

As described by the NIST Guide to Attribute Based Access Control (ABAC) Definition and Considerations: at access-request-time, “the ABAC engine can make an access control decision based on the assigned attributes of the requester, the assigned attributes of the object, environment conditions, and a set of policies that are specified in terms of those attributes and conditions. Under this arrangement policies can be created and managed without direct reference to potentially numerous users and objects, and users and objects can be provisioned without reference to policy.“  

ABAC allows for flexibility and scalability when building and enforcing policies, helping you avoid the “role explosion” required to cover all column access and anonymization scenarios when using RBAC. Immuta enables ABAC-based dynamic enforcement natively in Snowflake, without the need for managing Snowflake roles for your data analysts.  

This delivers the full solution. You get all the pieces with the right type of controls: 

  • Security is maintained by managing your own encryption keys for the data encrypted-at-rest (storage) as well as for the highly sensitive encrypted-on-ingest direct identifiers in your database.
  • Privacy is maintained through the dynamic fuzzing-on-the-fly, commonly termed dynamic masking that occurs as part of the query interface to provide a level of utility with privacy – using the ABAC model.  That same interface can manage the encryption/decryption operations on the direct identifiers.

And yes, it is possible to not encrypt anything on-ingest and instead manage all controls in the dynamic masking interface. To do this, there are algorithms to completely obfuscate values rather than fuzz them. This provides much more flexibility and functionality. If you take this approach, remember that some extra trust then lies with the cloud provider and your DBA for your direct identifiers rather than the Key Manager. For example, we’ve redacted last_name on the fly in our table:

immuta-provides-privacy-security-de-risk-cloud-analytics-table-8

You are now at the intersection of security and privacy and meeting the demand of both your data analysts and your legal and compliance teams.

But how do you actually implement it?

This is where Immuta comes in. Immuta can act as that interface between your queries and the database to execute dynamic masking and decryption. Below is a diagram of how it works:

immuta-provides-privacy-security-de-risk-cloud-analytics-architecture

The typical analytical user on the right is querying the data as usual from Snowflake and the Immuta interface is dynamically masking the data (privacy controls) based on policy – natively in Snowflake (no proxy). In other words, this encrypt-on-the-fly and fuzz-on-the-fly approach has Immuta leveraging built-in Snowflake features to ensure maximum flexibility and compatibility. The encrypted SSN can never be decrypted natively in Snowflake.

However, should a user with special access need to see the SSN column decrypted, they can leave native Snowflake and query through the Immuta proxy, which lives on-premises. In this case, that interface (the proxy) is able to decrypt the data using a customer-provided encryption/decryption algorithm service. This can talk to their own key management service – the details of which are completely abstracted from Immuta. Note this on-premises proxy can also achieve the dynamic masking and can support any other database – on-premises or in the cloud.

You can watch a video of this in action here:

 

 

Want to learn more? Request a demo today.

Request a Demo