The Ultimate Manual To Pseudonymization

Data security is a hot topic in the digital era. If you’re trying to comply with the European Union’s GDPR (General Data Protection Regulation) for securely storing personal data, pseudonymization is a strong potential compliance method for this legal framework. This guide will explain everything you need to know about pseudonymization and how it works.

What is Pseudonymization Anyway?

In simple terms, pseudonymization is a data security technique that replaces sensitive data with realistic fictional data. It’s based on the word “pseudonym,” meaning a false name or alias that obscures a person’s true identity, and it’s designed so that data associated with an individual can’t be traced back to an identifiable person.

For research, testing, analytics, data warehousing, and other business processes, the data still maintains its statistical accuracy. Only the identity of whose data it is gets obscured. When pseudonymization techniques are implemented correctly, they reduce the chances of associating an actual identity with a data subject.

The EU’s GDPR privacy laws specifically encourage the use of pseudonymization for data security compliance.

How Pseudonymization Works

Pseudonymization replaces all personal identifiers in a data set with some type of pseudonym. This process helps conceal the identity of a natural person or people in the data set.

The method allows you to switch original data, like a person’s name, with a pseudonym. So you could take someone’s name on a data set and replace it with an alias (John Smith becomes Jackie Johnson). But in most cases, pseudonyms are randomly generated alphanumeric replacements (John Smith becomes A1C329SG).

It’s important to understand that all pseudonyms are still considered to be personal data, according to GDPR compliance. That’s because the process of pseudonymization is reversible.

Pseudonymization lowers the risk of a data breach and helps protect personal information if someone steals the data. But with the right encryption key, hackers and scammers can reverse the process and potentially identify the people in the data set.

Here’s what the GDPR considers pseudonymous data:

  • Identifiers have been altered in a way so that it’s impossible to recognize an individual without using additional information.
  • All additional information must be kept separately and must be protected using both technical and organizational measures to ensure personal data isn’t linked to a natural person.

It’s worth noting that pseudonymization is not anonymous, although the terms pseudonymous and anonymous are often confused with each other or used interchangeably.

With anonymous data, the personal data associated with an identifiable person is permanently de-linked. For example, a person’s name could be encrypted, and the encryption key is destroyed—leaving no way for anyone to recover the name associated with that data. Pseudonymous data, by contrast, wouldn’t destroy the encryption key–though it would make sure that it’s kept securely and not accessible to anyone unauthorized.

Another reason why pseudonymized data isn’t anonymous is that it could be traced back to a person using “additional information.” True anonymity can’t be gotten around using external info.

Let’s say a company conducts internal peer-to-peer reviews once a year. But the names associated with each review are replaced by a number, so each employee can read their own feedback without knowing which co-worker wrote it. However, if the feedback contains certain unique phrasing only used by one person, the reader could determine who wrote the review.

The good news is that pseudonymous data is the next best thing to anonymous data, and pseudonymization is a highly secure and reliable form of data security despite not being fully anonymous.

Example #1: Tracking Customer Spending Data

It’s common for businesses to track different data metrics and KPIs associated with customer spending. Whether transactions occur online, in-person, or over the phone, different POS systems make it easy for companies to collect this type of data.

However, customer spending data can contain sensitive information. It’s not necessary for data analysts to access this information to do their jobs properly.

Let’s take a look at how pseudonymization can be applied in this example. Here’s a sample data set that could be collected:

Name: Steve Garcia
Account Number: 12345
Email: steve@gmail.com
Transaction Date: 05/10/2021
Transaction Amount: $137.50

Now let’s see that exact same data after it’s been pseudonymized:

Name: AB492
Account Number: XFS23
Email: jsktys@gmail.com
Transaction Date: 05/10/2021
Transaction Amount: $137.50

As you can see, the data still maintains its statistical integrity after pseudonymization occurs.

The person’s name and account number have both been replaced using randomly generated alphanumeric codes. The email has also been changed but remains realistic. This makes it easy for an analyst to know they’re looking at an email address without referring to the column heading, which is useful for large data sets.

The transaction dates and amounts both remain intact, as this part of the data doesn’t identify the person.

As previously mentioned, this data is not completely anonymous because it can be switched back with an encryption key. Furthermore, if the employee working the register on that day remembers this transaction amount because it was unusually high, they could potentially identify the customer with the data—even after the pseudonymization occurs.

Example #2: User Behavior For Targeted Advertising

Let’s say a company has a mobile app. In order to use the app, users must sign up using their name, email address, and phone number.

The app automatically tracks usage behavior and links the data to each account. Then marketers can take this data to segment users and target them based on behavior. For example, if a user hasn’t opened the app in a week, maybe you’ll send them a push notification with an offer or incentive to open.

Or maybe you want to personalize each user’s home page with recommended products based on browsing behavior—there are lots of possibilities here.

You can still accomplish your goals and protect sensitive user data at the same time using pseudonymization. Here’s a sample data set that we can evaluate:

User: Julie Smith
Email: jsmith@gmail.com
Phone: (206) 555-5555
Last Open Date: 02/03/2021
Average Session Duration: 1:46
Average Screens Per Session: 3

Pseudonymization can remove the identifiers associated with this user so the data can be analyzed for marketing purposes. Here’s what it looks like once the sensitive data has been replaced:

User: Suzy Williams
Email: abcdefg@gmail.com
Phone: (206) XXX – XXXX
Last Open Date: 02/03/2021
Average Session Duration: 1:46
Average Screens Per Session: 3

The marketing team can still use this information to send the user targeted messages based on engagement, as the last open date, average session duration, and average screens per session data remain the same.

How to Get Started With Pseudonymization

As you can see, pseudonymization can be a viable option for protecting user data and taking steps toward compliance with GDPR laws. To help you get started with applying pseudonymization for your specific use cases, follow the steps below:

Step 1: Determine What Type of Data Needs to be Pseudonymized

Before you can do anything, you need to have a firm grasp on the type of data containing personal identifiers that can be traced to a specific person.

What constitutes “personal data” under GDPR compliance is very broad. Examples include:

  • Names
  • Phone numbers
  • Credit card numbers
  • Personal identification numbers
  • Account data
  • License numbers
  • License plates
  • Address
  • IP address
  • Device ID number
  • Cookies
  • Location data
  • Passport numbers
  • Social security numbers

The term “personal data” is meant to be as broad as possible, so businesses or anyone collecting data err on the side of caution with what could be linked back to a natural person.

For example, the GDPR doesn’t go into detail saying that “job title” must be protected. But depending on the use case, that information might need to be pseudonymized.

Let’s say you’re collecting data about employees in the workplace and sending the information to a third-party firm for an evaluation.

If you have a huge company with 5,000 sales reps, you may not think it’s necessary to replace that job title with a pseudonym. But if the Chief Financial Officer is mixed into that data set, it would be fairly easy to identify the person associated with that data. In this scenario, all job title data would need to be pseudonymized.

So take a look at the data you’re collecting and what you’re planning to do with it. Even if you’re just storing data, you need to pinpoint which identifiers can be traced to a person—and that’s what you need to focus on for pseudonymization.

Step 2: Understand the Different Methods of Data Pseudonymization

Pseudonymization is a broad term. Any data management technique in which a de-identification process has been used to replace personal information fields with an artificial identifier would fall into this category.

There are several different ways to actually implement pseudonymization and apply it to your specific use case. Here’s a brief overview of the most popular options:

  • Data Scrambling — You can apply the scrambling technique by mixing or making the data obscure. It works better in some scenarios compared to others. For example, the name “Steve” could be scrambled and turned into “VTSEE.” If someone figures this out, they could potentially unscramble the pseudonym manually without the encryption key. But this is much more difficult to do with long strings of numbers.
  • Data Encryption — Encryption takes pseudonymization one step further by making the data completely unintelligible. In this case, the name “Steve” could be encrypted and turned into 00000. Usually, this type of data is protected with an encryption key that can only be turned reversed by whoever possesses the key.
  • Data Masking — With data masking, you can just hide the most important pieces of the data set using random data or other characters. For example, you could take an IP address like 216.58.216.164 and mask it with XXX.XX.XXX.X64.
  • Tokenization — Tokenization replaces sensitive data with substitutes known as tokens. The tokens don’t have any extrinsic value or meaning that can be exploited if they fall into the wrong hands. With tokenization, you could turn data like bob@gmail.com and turn it into aZT4 cQ76 R#4+ bzS7.
  • Approximation — Approximation replaces personal data with a less specific value. For example, let’s say that a customer’s birthday is October 1, 1989. The approximated record could be stored as September 1 – December 31, 1989, or just as 1989. This removes the personal identifier but still gives analysts the ability to retrieve useful information.
  • Data Blurring — As the name implies, data blurring renders data values obsolete by obstructing them completely. For example, if your data contains a photo of the user associated with a specific profile, you can blur the user’s face, so it’s impossible to identify them.

The right pseudonymization method for you depends on what you’re using it for. In some cases, you might use a combination of these techniques to protect user data and remove sensitive information with personal identifiers from your data sets.

There are different software options and IT tools out there that can be used to implement these methods.

Step 3: Store Personal Information and Pseudonyms Separately

As previously mentioned, pseudonymization is reversible. So it’s possible to take the pseudonymized data and turn it back into its original form.

That’s why it’s so important to keep the information separate.

If a hacker, scammer, or malicious attacker is able to steal some of the information, it will be useless without the corresponding data.

For example, you could keep the data on different servers, hardware, or cloud accounts. In the event of a breach, not all of the personal information will be exposed.

Step 4: Know Your Compliance Obligations

If you’re implementing pseudonymization for GDPR compliance, I strongly recommend that you review the Pseudonymization Techniques and Best Practices guide.

This resource was published by ENISA—the European Union Agency For Cybersecurity.

The document contains in-depth technical solutions that can support the implementation of pseudonymization. It covers different techniques to protect data against brute force attacks, guesswork, and dictionary search using pseudonymization as well.

Compliance varies by industry and location. For example, HIPAA in the United States requires the use of pseudonymization for data sharing. But outside of Europe, general consumer data protection laws worldwide may not be as strict as the GDPR.

Depending on the scenario, pseudonymization alone might not be enough to remain GDPR compliant. It’s your responsibility to understand the compliance and regulations related to your specific business or industry.

For example, pseudonymization doesn’t release your obligation of acquiring consent from users before you collect their data.