I’ve seen written many times that “Privacy can be improved with techniques like differential privacy and homomorphic encryption”. Sure, but for totally different reasons.
Homomorphic encryption is an added layer of security that allows computation against encrypted data without decrypting it. Without homomorphic encryption, databases must first decrypt the data before they can answer your question, which leaves a small opening for an adversary listening on your database to intercept information. This is what I’d term an enhanced security technique – not a privacy technique alone. If that sounds like splitting hairs, it really isn’t. Security and privacy are different yet related concepts. In order to understand where homomorphic encryption fits, we need to begin by understanding where security and privacy overlap – and where they don’t.
Security is about hackers. Those hackers can be inside your company walls, or, more often than not, outside your walls. Those hackers want to breach your data, and security is your countermeasure. A common security method is authentication, managed through passwords or two-factor authentication. This is how your email client ensures it’s you logging in and not someone else. Network firewalls are another critical way to protect against hackers – think of this as a way to prevent external adversaries from getting to the data you’re trying to protect because they can’t even get on your network in the first place. Lastly, and a little less obvious method, is encryption – this ensures that if someone were to physically steal your server or the data on it, or if they were intercepting network traffic, the data they would get would be encrypted (think of this as scrambled in a way that only people with the right key can descramble) and of no use to them. As mentioned, homomorphic encryption is an advanced method of encryption.
Authentication, firewalls, and encryption are black and white: you are either allowed in, or you aren’t. You can login to your email, but your colleague can’t; you can get through your network firewall, but someone outside of your company can’t; you can access data through official channels which will decrypt it for you, but an adversary intercepting data through unofficial channels will get garbage. Doctor Seuss might say, “The allowed are in, the unallowed are out.”
Privacy is not black and white. In fact, the whole reason privacy is so challenging is exactly this reason. You need to find middle ground: enable analysis, while preserving some level of privacy. This is commonly referred to as the “privacy vs. utility trade-off.” Privacy techniques enforced within companies provides assurances about your data use/misuse that can be made without relying on blind trust in those companies or their employees. This sounds easy on the surface, but let’s talk about some major challenges here.
Consider the keyboard on your smartphone. Some fancy keyboard apps will predict what you’re going to type based on your personal tendencies to help you type faster. To do this, they capture what you’re typing and details about you, send it back to their servers for analysis, and ship back an updated algorithm to your phone to make predictions. Why not just have the algorithm learn on your phone? Well, servicing a user that just installed the app, a “ground zero” user, would result in predictions that are not yet personalized for that user. A generalized global model is required to solve this scenario. To do so, the global algorithm can leverage your personal information to categorize you and make predictions – “you are at IP address 22.214.171.124, with an Android phone, I’m going to make predictions based on similar users.” As you can see, employees at the mythical “Super Keyboard” can access anything you’ve ever typed and much of your related personal information. This is an example of getting unbounded utility from the data at the expense of privacy – something the GDPR and other regulations want changed.
Privacy measures can make this less invasive, such as masking your name and/or phone ID. Or, your age could be bucketed to ranges rather than your precise age. This continuous data bucketing is called k-anonymization. Privacy techniques, such as masking, generalization. and k-anonymization, are commonly referred to as pseudonymization techniques because they help, but aren’t true anonymization (hence “pseudo-anonymization”). Pseudonymization means privacy is preserved, but there’s no way to quantify how well preserved. There are also more complex and protective techniques like differential privacy. Differential Privacy, unlike pseudonymization techniques, can actually be true anonymization allowing you to quantify how much privacy is being preserved. There are other even simpler controls that can be implemented, such as data retention policies – “delete user keyboard data immediately after use” – but may not be possible in many cases because that data is required for other purposes.
Think of privacy controls as a gear that can be tuned: on the left is pure randomness (no utility) and on the right is complete utility (no privacy measures enforced at all). Differential privacy, for example, would be skewed to the left of that gear but still provide a good enough mix of privacy and utility to build a meaningful model (in many but not all cases). This gear should be tweaked based on who you are in the organization and what you’re trying to do with the data – and it should not, and cannot, be black and white at all.
Privacy techniques have been around for a long time, as have the regulations that have driven them. HIPAA sets pseudonymization standards so you can transfer health records and get appropriate care without giving up privacy (but as you now know with pseudonymization, no guarantees of privacy). GDPR is another regulation that forces companies to manage data with a privacy-first mentality – what the EU regulators term “Privacy by Design.” These controls are complex to understand and implement, especially in large dynamic corporations with many algorithms being developed and new data arriving constantly.
But effective privacy techniques, when implemented appropriately, allow data subjects to reap huge benefits – in this case, have your keyboard predict what you’re going to type without giving up your privacy to random humans at some “Super Keyboard.” Trust is the future of consumer brand relationships. (Side note, Google’s keyboard has many privacy measures, even beyond what was discussed here).
Now, back to homomorphic encryption.
The shortcoming of homomorphic encryption as a privacy technique is that it’s the same black and white / all-or-nothing paradigm as authentication, firewalls, and encryption – it’s just another security measure for data breaches. Of course without security, there is no privacy. So homomorphic encryption enhances privacy but it’s not a privacy technique in and of itself. Differential Privacy, on the other hand, is a privacy technique, but, when left alone without security measures, also doesn’t get you to the finish line. A great example of combining a privacy technique with homomorphic encryption is found in Google’s Secure Aggregation paper.
If privacy is your goal, you need to carefully consider the differences between privacy techniques and security techniques. While organizations struggle with data breaches and consumer trust, had they also incorporated privacy techniques along with security techniques, their breaches would have been less devastating or not have been breaches at all.
Special thanks to Alex Ingerman, product manager at Google AI, for his personal (not associated to Google) feedback on this blog.