Why Differential Privacy Changes the Game – Exploring Immuta’s New Feature
We’re very excited to announce that differential privacy is now a data anonymization policy available in Immuta’s 2.0 release.
Why Is Differential Privacy A Game-Changer?
Data anonymization techniques have been around for a long time. The most used technique is to mask values in the data to hide their true meaning yet still provide utility in the data. For example, one could chop the remaining digit from an address zip code to remove precision and thus protect exact locations of data subjects. This kind of technique worked really well last century — but it doesn’t work anymore, and that’s where differential privacy comes in. Differential privacy is the first and only way of providing guarantees that any individual record within a dataset cannot be identified. And now it’s in our product, and couldn’t be easier to use.
Before I go on, though, it’s worth addressing the failures of traditional, non-differential privacy methods of anonymization, just to make the importance of differential privacy even clearer.
As a quick example, if I asked you in 1985 what the run time of the Terminator movie was, what would you do? You’d probably have to physically go to a video store and read the box outside the VHS tape (be kind, rewind!). If I asked you that question in 2017, you could answer it in a few seconds by using an online search engine. Now let’s flip the script: If I told you to figure out what popular movie has a run time of 1 hour and 47 minutes and was made in 1984, again, you could figure this out pretty easily in 2017…but in 1985 you’d really have a hard time. This illustrates the problem of trying to protect the movie title by masking the title, actors, and description, leaving the run time and year, which is the way traditional anonymization techniques function. The run time and year is plenty of data for us to crack the anonymized movie title today, but not in 1985. This is why statistical bureaus could use this technique last century, but when Netflix releases data today, they get mud in their face. This is called a link attack, and as more and more data are at everyone’s fingertips, link attacks become increasingly easy to achieve.
Which brings us to differential privacy. In 2006 researcher Cynthia Dwork proposed a new technique that can actually provide privacy guarantees, and she termed it “differential privacy.” Here’s how Dwork described it: “In many cases, extremely accurate information about the database can be provided while simultaneously ensuring very high levels of privacy.”
Now we’re getting somewhere. If you can provide statistical guarantees of privacy, that makes sharing data a lot simpler. And if you can share data, you can reap the benefits in several different ways: secondary use cases for existing data (not originally collected for that purpose), selling your data, collaboration with skilled external engineers/scientists, data exchanges that make you and your collaborator more powerful together, or even philanthropical use cases, all while protecting the privacy of your data subjects.
Making the Mathematics Behind Differential Privacy Simple
Typically, when you read about differential privacy you get a math equation thrown at you with unintelligible explanations. I’m going to try and simplify this for you, because, in fact, it’s pretty simple when leaving the fine details out.
Let’s jump right into an example where we have sensitive information to protect. Imagine we’re trying to find out the proportion of people that hide purchasing information from their spouses. We’ve gathered 100 people in a room, and I first ask them to pick a number between 1-10, but not to say their answer out loud. Next, I ask them to to raise their hand if they hide purchases from their spouse or picked a 3 in the previous question. Because we’ve asked two questions, we’ve injected noise into the response, providing plausible deniability to everyone who raises their hands. Based on the amount of hands raised, and knowing the probability of randomly choosing a 3 we can calculate the true proportion of spouses who hide purchasing information but also protected their privacy. In essence we’ve added noise to the response to protect the privacy of the individuals.
Now, what if I asked the same question, but caveated it by saying if you hid purchases and are wearing a pink shirt. In this case, now people may be a bit more apprehensive because the question got a little too sensitive (there aren’t that many pink shirts in the crowd). We probably need more noise, and we could add more noise by saying everyone has to pick a number between 1-3 instead of 1-10.
This technique of adding noise to data during the data collection time is known as Randomized Response. Google recently published a paper describing how they use this concept to collect anonymous usage statistics in their Chrome browser.
Adding noise based on the sensitivity of a question is the heart of differential privacy. However, since our customers haven’t injected noise into their data collection processes, Immuta adds noise to the database query results/response dynamically at query time. In fact, we can add noise in such a way to statistically guarantee the privacy of individual records, on the fly, just like all the other policies in Immuta
This Sounds Easy, Why Did It Take 11 Years To Get In Commercial Software?
Immuta is the first company to dynamically enforce differential privacy on data without the use of a custom database or custom query language. But if it sounds so straightforward, why isn’t everyone else doing it? A few major challenges have stood in the way:
Challenge 1: Aggregate Questions Only. To make differential privacy work, you have to restrict questions to aggregates only. What does this mean? I can only ask questions that respond with a number, because I have to add noise to the response. That means you can only ask questions like average, sum, count (like we did in the example above), median, etc. In other words, you can’t ask for literal rows in the data, only aggregate questions of the data. Since Immuta acts as a virtual control plane between the data analysts/scientists and the databases, we can also enforce SQL restrictions on the types of SQL questions you can ask. In other words, our virtual control plane provides the perfect injection point for these restrictions on the SQL statements.
Challenge 2: Determining Sensitivity. I don’t want to get bogged down with the details here, but sometimes it’s really hard to tell how sensitive a question is based on a database query. Remember the pink shirt example, we’d have to know there aren’t that many pink shirts and assign noise accordingly. In fact, we’d have to know the possibility of any possible number of pink shirts randomly showing up in our test group. Again, Immuta’s control plane lends itself well to this, as we have an internal technique for dynamically and quickly determining the sensitivity of the question relative to the available data, made possible by intercepting and managing the query.
Challenge 3: Privacy Budget. This one is a bit confusing, but again, it can be pretty easily explained with an example. Let’s go back to the same example we used above, but we repeat the experiment several times having everyone choose new random numbers and adjusting our sensitivity accordingly.Given enough samples of anything, the average of those samples converge to the true value. This is a problem because the noise is drawn from a random process and added to the response. The classic way to prevent this problem is to limit the amount of questions people can ask, which is what’s known in differential privacy as “the privacy budget.” However, this is a poor solution because it typically leaves a very small budget of questions, which in turn doesn’t allow data exploration. This is probably the number one reason differential privacy hasn’t gone mainstream. Again, Immuta is able to avoid this problem because we sit in a privileged location between the data and the data consumer. We can capture the fact the question has already been answered, and instead of asking the room to think of a new random number, we give the same noisy response we previously calculated. In fact, you can tell Immuta how often to refresh the noise in our response based on how often your data is changing under the covers (let’s face it, most commercial data isn’t static).
Challenge 4: Too Noisy. Sometimes folks can’t help themselves and they ask very sensitive questions, which means they get back really noisy responses. This could be bad for analysis. Using our sensitivity technique, Immuta can understand the sensitivity of the question and instead of adding a significant amount of noise for very sensitive questions, we instead simply block the query – “ask something less sensitive, please”. This will aid data scientists in learning to use differential privacy and protect them from using responses that are wildly inaccurate.
How Would Immuta’s Differential Privacy Work In My Environment?
The nice thing is that our differential privacy technique works like all the other policies in Immuta. You add it easily through our policy builder on data exposed from any database in your organization. The noise will only be added for users that don’t meet the condition of the policy! So let’s see this in action: