The Ultimate Manual To Anonymization
With the growing need for companies to safely store, share, and manage sensitive personal data, many organizations turn to data anonymization for additional security. This is especially true for businesses that must comply with strict data privacy laws, like GDPR in the European Union. The in-depth guide below will explain everything you need to know about data anonymization and how it works.
What is Anonymization Anyway?
Anonymization is a data privacy technique that protects sensitive information associated with an individual. The process erases or encrypts any personal identifiers that could link data to a natural person.
Examples of personal identifiers include names, phone numbers, addresses, social security numbers, and more. Anonymization protects the statistical accuracy and integrity of the data while keeping the data sources anonymous.
In an era where data collection and analysis are crucial for business operations, data security methods like anonymization are more critical than ever before.
Anonymization allows businesses to analyze and share data associated with customers and users without revealing the actual identities of the data source.
In addition to protecting sensitive customer data, some jurisdictions require organizations to implement data security methods for personal information. GDPR (General Data Protection Regulation) fines can cost companies up to €20 million or 4% of the company’s annual revenue, whichever is higher.
How Anonymization Works
In terms of data security, anonymization is one of the strictest ways to protect sensitive information. Unlike other types of data protection methods, anonymization can be irreversible.
Depending on the process used, certain parts of a data set containing sensitive personal information can be encrypted, and then the encryption key gets deleted.
In many cases, anonymization is a one-way process. However, hackers and malicious attackers could potentially cross-reference anonymous data with other sources available to the public in an effort to expose personal information. This process is known as de-anonymization.
Data anonymization is similar to data pseudonymization, although the two terms are not interchangeable.
Both anonymization and pseudonymization are data security techniques recommended by the GDPR for personal information protection. However, pseudonymization is reversible.
For example, let’s say a basic data set contains a person’s name, phone number, and home address along with their transactional history at a particular business. With anonymization, the personal identifiers (name, number, address) will be permanently erased or encrypted. But with pseudonymization, the identifiers will be replaced with pseudonyms to conceal the customer’s real identity.
It’s easier to re-identify data sources with pseudonymization. So anonymization should be used for instances where a complete dissociation needs to occur between a person’s identity and their data.
Example 1: Health Insurance Underwriting and Medical Reporting
The Health Insurance Portability and Accountability Act (HIPAA) is a federal law that creates national standards to protect sensitive patient health data.
Any information that can individually identify a natural person is protected by HIPAA.
But health insurance companies still need to create different plans for group or individual health coverage. To create these plans effectively, they need to analyze various health data to assess risk.
The data analytics must be completed without disclosing any patient records that would identify the individual. Things like names and social security numbers would need to be scrubbed from the data sets until rendered anonymous.
Similarly, medical facilities might need to report certain information to research groups for medical testing. Maybe a research lab wants to know if people have any allergic reactions to a specific prescription or vaccination.
The medical facility collecting this data could tell the research facility whether a patient had a reaction and what the reaction was. But everyone’s personal identifiers would need to be eliminated from the report.
Example 2: Business Intelligence Reporting Dashboards
Organizations worldwide collect data on their customers for a broad range of potential use cases. It’s common for this information to be used in reports so company executives can make data-driven decisions.
Businesses can take raw data and anonymize sensitive personal information before generating reports.
Rather than analyzing data individually, a company could create reports using generalized information as a whole. For example, the age of customers can be segmented into groups (18-25, 26-35, etc.) and plotted on a graph.
Average transactional values can be analyzed as well, without using the personal information associated with each individual buyer.
Executives, managers, and other decision-makers company-wide can still gain crucial insights into what’s happening with their customers without sharing any sensitive or personally identifiable data. Instead, everything can be evaluated through more of a big-picture lens while remaining compliant.
How to Get Started With Anonymization
As you can see from the examples above, there are many potential use cases where anonymization can help protect sensitive personal data. To get started with applying anonymization to your business, follow the simple steps explained below:
Step 1: Determine if Anonymization is Right For Your Situation
The first thing you need to do is determine whether or not your situation calls for anonymization. In some circumstances, pseudonymization might be a better alternative.
Anonymization should be applied in situations where you want to completely lose the connection between the data and an individual. It’s usually used in statistical or research-related scenarios.
Once anonymization is applied, it’s technically out of the scope of the GDPR. That’s because the data no longer contains any personal information—it’s gone forever.
While data anonymization is meant to permanently de-identify a person with their corresponding data, there are still other ways to indirectly re-identify a person.
For example, let’s say you were analyzing data from a local coffee shop. Through the process of anonymization, the customers’ names and phone numbers associated with their loyalty accounts have been permanently removed from the data—all that remains is the transactional information.
But an analyst could still see patterns in that data. If every day at precisely 10:15 AM a customer buys an iced double espresso with soy milk and a croissant, you would be able to identify this person from their buying behavior.
Organizations that want to truly anonymize sensitive data should be careful to hide any pieces of information that would allow for re-identification.
Using that same example, you may ultimately decide to eliminate the individual transactional information from your data. Instead, you could look at the total number of items ordered and the corresponding value of those transactions on an hourly basis.
It’s more challenging to use anonymization when analyzing individual customer or user data because you’ll need to eliminate additional identifiers.
The GDPR requires websites to obtain consent from visitors to collect personal information like cookies, device IDs, and IP addresses. By collecting this type of data anonymously, it limits your ability to retrieve value from the data as it relates to personalized marketing efforts.
For data associated with marketing and user behavior to aid with personalized campaigns, pseudonymization might be a better alternative to anonymization.
Step 2: Identify Data That Needs to Go Through Anonymization
After you’ve confirmed that anonymization is right for your situation, it’s time to narrow down the data that must be made anonymous.
In terms of the GDPR, “personal data” is intentionally used as a broad term. As we’ve seen with previous examples, certain data types might not seem personal at first glance. But when paired with additional information, some information could be used to re-identify an individual meant to be anonymous.
Generally speaking, the following would all constitute personal data that would need to be anonymized:
- Names
- Addresses
- Phone numbers
- Account numbers and ID numbers
- IP addresses
- Cookie data
- Web locations
- RFID tags
- Biometrics data
- Race and ethnicity
- Sexual orientation
- Political affiliation
- Vehicle identification numbers (VINs)
- License plates
- Social security numbers
Under GDPR law, no consent is required to collect anonymized data because the data no longer contains any personal information.
Here’s an example that explains how strict the GDPR is when it comes to data anonymization. A taxi company in Copenhagen, TAXA 4×35, was found to be in violation of the GDPR.
The company thought they complied by anonymizing data associated with the names of users in their database. However, TAXA 4×35 was not anonymizing data with the collection and delivery addresses of the riders.
This information could be attributed to a natural person, so the company was fined roughly €160,000.
Make sure you have a firm grasp of the data you’re collecting. What might not initially seem like something that should be anonymized could put you in violation of regulatory laws in certain jurisdictions.
Step 3: Choose Your Anonymization Technique
There are several different ways to achieve data anonymization. I’ll explain some of the top anonymization methodologies in greater detail below so you can determine the best option for your business.
- Data Masking — Data masking modifies the values of a data set. It can be accomplished through character shuffling, character substitution, or encryption. For example, someone’s name in a data set could be replaced with an “X” or a “0,” making it difficult to identify or reverse-engineer the individual. Data masking is commonly used for billing information, where credit card information gets listed as XXXX XXXX XXXX 8972 on file.
- Data Scrambling or Shuffling — This anonymization technique involves mixing the letters or digits of any data deemed to be personally identifiable. For example, an account number like #97531 could become #39517. This works best for long strings of numbers where the possible combinations would be tough to figure out.
- Generalization — The generalization method for anonymization excludes certain components of the data, so it’s less identifiable—the goal here is to remove personal identifiers while maintaining the accuracy of the information. For example, you could delete the home address in your database and replace it with their regions, such as Northeast or Southwest. Or you can replace a birthday with an age range, like 18-25, 26-34, 35-56, etc.
- Data Perturbation — This technique modifies the original data by adding random noise and applying round numbers to the data. As long as the values are proportional to the deviation, the data could still be valuable. But in some instances, changes could nullify the ability to use the data for anything meaningful.
- Synthetic Data — You can use algorithms to manufacture data that doesn’t connect to real events. Companies use synthetic information to create artificial databases by altering the original data. Applying medians, linear regressions, standard deviations, and other statistical models can generate synthetic data worth analyzing while protecting the original data source.
- Data Blurring — Data blurring makes it harder to identify an individual with certainty by using approximated values, similar to the generalization technique. For example, you’d be able to identify a natural person by their account balance at any given point in time. But by adding a small random value (like $1.26) to this balance, you can make the person anonymous without adding a significant amount of error to the data.
- Data Encryption — Encryption translates all of the personal identifiers in the data into an unreadable format. Only an authorized user with the encryption key or password can change the data back to its original form. In some instances, the encryption keys are destroyed, and the anonymization process is permanently irreversible.
- Null Data — In this scenario, all sensitive data is immediately deleted from the data set. Any piece of sensitive information, like a name, address, or phone number, will be displayed as null values in the data set.
- Data Swapping — The data swapping technique rearranges the attribution values of a data set so they don’t fit the original form. For example, you can swap the information in specific columns with values that are unrecognizable. Doing this with something like a date of birth can help make data anonymous.
As you can see, there are lots of ways to apply data anonymization. Depending on your scenario, you might ultimately use more than one of these techniques to make your data anonymous and protect sensitive personal information.