Data generalization is the process of creating a more broad categorization of data in a database, essentially ‘zooming out’ from the data to create a more general picture of trends or insights it provides. If you have a data set that includes the ages of a group of people, the data generalization process may look like this:
Ages: 26, 28, 31, 33, 37, 42, 42, 46, 48, 49, 54, 57, 57, 58, 59
Data generalization replaces a specific data value with one that is less precise, which may seem counterproductive, but actually is a widely applicable and used technique in data mining, analysis, and secure storage.
When Do You Need Data Generalization?
One of the primary applications for data generalization is when you need to analyze data that you’ve collected, but also need to ensure the privacy of the individuals who are included in that data. It’s a powerful way of abstracting personal information while retaining the usefulness of the data points. In the age example above, generalizing age data based on each decade gives a general picture of where the individuals in the data set fall, and still allows you to use that data for relatively accurate targeting or analysis.
There are multiple ways to approach data generalization, ranging widely in effectiveness and retention of the data’s integrity. In cases where you have multiple identifying data points but only one or a few are relevant to your needs, you can often use more robust generalization techniques on the irrelevant data points while leaving the relevant points more or less intact.
Another key point when it comes to data generalization is compliance — there are regulations in place that determine how much identifying information about individuals can be left unchanged. Make sure you’re aware of your specific industry’s regulatory requirements to avoid potential data leaks or unauthorized exposure.
Main Data Generalization Types
There are two primary types of data generalization, and which you use in a given instance depends on a range of factors — the type of data, your specific needs and goals for that data, and the privacy and security requirements put in place by your organization, your industry, and/or government regulatory bodies.
The two main types of data generalization are automated generalization and declarative generalization. Let’s explore what each one means and how it looks in practice.
Automated generalization uses algorithms to determine the minimum amount of generalization or distortion required to ensure proper privacy while retaining accuracy. This predetermined value of generalization is usually referred to as k, and the process of k-anonymization is one of the most common generalization techniques.
If k=2, the data is said to be 2-anonymous. That means that the data points have been generalized enough that there are at least two sets of every combination of data. The chart of ages mentioned earlier is 2-anonymous because there are at least two instances of each ‘category’ of data, in this case age ranges.
If a set of data featured the location and age for a series of individuals, then the data would need to be generalized to the point that each age/location pair in the data appeared at least twice.
Declarative generalization involves manually deciding how large your data bin sizes will be in any given scenario. In our age group example, we determined that the bin size would be a decade. If this were a real-world data set, our reasoning may have been that we determined that bin size provided the highest level of privacy and security for each individual in the dataset without compromising the usefulness of the data.
There are some inherent limitations to declarative generalization, most notably the fact that it can sometimes distort or bias the data due to outliers often being excluded entirely. That said, declarative generalization can be a good starting point when sharing secure data to ensure that whoever is receiving that data doesn’t have any more specificity than is required to achieve the desired result.
Data Generalization Identifiers
Identifiers are data points about an individual that could be used to determine their identity and/or linked to other information about that individual. There are two main types of identifiers — direct identifiers and quasi identifiers.
The distinction between the two matters, because the way you deal with both direct identifiers and quasi identifiers will determine whether your data is truly anonymous or just gives the illusion of being anonymous. Even major, established enterprises have made news when they’ve publicly released information they thought was adequately generalized, only to have third parties re-identify individuals in the data set.
Here’s what you need to know about direct and quasi identifiers.
Direct identifiers are data points that allow an individual to be identified and also allow other data to be linked to that individual. A data point can be a direct identifier even if multiples of that same data point exist in data. For example, even if there are two individuals with the name “John,” the name is still considered a direct identifier.
Quasi identifiers do not allow you to identify an individual on their own, but could be combined with other data to do so. Quasi identifiers can be unique within a data set, but they are also either already present in other unique data sets or are likely to show up in other data sets in the near future.
Let’s say your data set contains an individual’s gender and zip code. Obviously, there will be enough other people who are that gender and live in that zip code that this individual can’t be identified from those two data points alone. But if that person also appears in another data set that includes their gender, zip code, and even more information about them, between the two data sets, someone may be able to link the data and identify the individual.
When it comes to de-identifying information, or removing enough direct identifiers and quasi identifiers that the individuals in question cannot be identified, there are two primary techniques — generalization and randomization.
Once direct identifiers in data have been masked, data can be generalized via k-anonymization. The key to effective generalization is having a streamlined process that addresses all privacy and legal concerns, and Immuta enables data teams to consistently apply k-anonymization across any database to seamlessly prepare data for use.
Randomization offers a mathematical guarantee that sensitive information and attributes cannot be used in inference attacks. The attributes are randomized so that the amount of personal information available in the data set is limited, so individuals can’t be identified, but the data’s integrity and utility remains intact.
Make Generalization and Data Security Simple
Keeping data secure through techniques like generalization does not have to dominate your time or resources. By automating your data access controls and security processes with Immuta, you can achieve data security and integrity while staying compliant with all legal requirements — so you can focus on growing your business with data.
Request a demo today to learn how Immuta can transform the way you handle your data — even the most sensitive data.