“Over the past few years, data privacy has evolved from “nice to have” to a business imperative and critical boardroom issue,” warns a recent Cisco report. Security and privacy are colliding in the cloud and creating challenges never before seen. This is particularly true for organizations making the move from on-premises database warehouses to SaaS-based data platforms like Snowflake where users outside the organization (cloud provider and SaaS provider) need to be trusted at some level. To tackle this problem, you need to think about layers of trust and where you want that trust to live in conjunction with data utility tradeoffs. Let’s explore these different levels of trust and what protections can be deployed to make them practical.
To set the stage, let’s first focus on privacy. There are really four data categories of privacy:
- Direct identifiers: think credit card number, name, drivers license
- Indirect Identifiers: your sparsely populated zip code, your rare car make/model, a rare disease you have
- Sensitive Information: your sexual preference, a disease you have
- Other data
Sensitive information can overlap with indirect identifiers (e.g., a disease you have). This makes indirect identifiers interesting; they can straddle the line between both sensitive information and direct identifiers. How can an indirect identifier be a direct identifier? Consider the Netflix Challenge privacy scandal. When Netflix released their data for the challenge, it contained no direct identifiers indicating who rated what movie. It included the ratings themselves along with other information to help build movie predictions – one would think this is simply categorized as “other data.” But by taking the ratings and comparing them to other rating sites, the attacker was able to identify the movie raters, ultimately leading to the lawsuit (this is termed a linkage attack). The ratings themselves – presumably the most important part for the movie prediction algorithms – are in fact also indirect identifiers that broke privacy!
To effectively deal with privacy, you not only have to manage direct identifiers but, depending on your legal and ethical obligations, you also need to protect indirect identifiers and sensitive information from your own employees. This is the key because when you talk about indirect identifiers and sensitive information, that’s the actual data you want/need to analyze. As you can see from the Netflix Challenge, the movie ratings can’t just be “turned off.” So what do you do? We’ll come back to that.
Security is a different but related beast. More specifically, security breaches are about direct identifiers. If someone breaks into your database account, for example, they can immediately read your data and see everyone’s credit card number. Or worse, it can be your own disgruntled database administrator (DBA). When you think about it, all an attacker is trying to do is gain illegitimate access to the rights your employees legitimately have – especially administrator level access. If someone has unfettered access to your data, they can see all your data no matter the privacy category.
Let’s consider the following diagram and table, and imagine there are no controls at all:
In this scenario, all three personas can run queries against any table. Additionally, they could bypass the database altogether and read directly from storage. Note that we are oversimplifying “read” from storage in this example. This does not mean “read” through a service, but rather literally reading the data directly from the hard drive.
This architecture should be undesirable for everyone. Let’s move to the next level.
You probably don’t want to trust your DBA or the user to read directly from storage. That’s the purpose of the database and the database controls (which we’ll get to in a bit). This is why both your cloud service provider and Snowflake use encryption at rest.
This gives you two layers of Transparent Encryption: one at the cloud service layer and another at the Snowflake Data Platform level, where every file holding your information is uniquely encrypted. But how do the database queries work, then? This is pretty much how your laptop works; in that read interface between your queries and the data, the data is all “in the clear” as far as the query is concerned. This is just like how you encrypt the storage on your laptop, you don’t see every time you open a file that it’s actually being decrypted for you in that read interface – but it is. That’s why you still see everything in the table even though it’s encrypted at rest. So how is this protecting you? If you configure it all well, it means the people at the hardware level can’t see the data (just like someone who steals your encrypted laptop hard drive but doesn’t have your password). Since Snowflake is encrypting all the files too, you’re also protected from the cloud service provider staff as a whole with “customer managed keys” technology.
That’s as far as you can go at the lower layers because it just results in shifting trust somewhere else. This leads us to the fun part. You don’t want everyone to be able to query every table. To ensure this doesn’t happen, you now have our DBA place table level controls restricting who can query what tables, which results in this picture:
Now your users can’t go around querying any table they want. The table above is still in the clear, but at least only to certain people. In Snowflake, this uses the same familiar RBAC controls most every Data Platform has used and is enforced in SQL. For example, your DBA would be able to see it along with the users that were entitled to see it.
At this point, we’ve set the foundation for very basic security but with several remaining holes. In our next installment, we’ll explore how you can create the appropriate mix of security and privacy.