The Ultimate Manual For Data Masking

In the US, the total cost of a data breach averages at $8.19 million, which is twice the global average. This is just one reason why safeguarding confidential business and consumer information has become more important than ever.

Besides that, businesses should ensure they use private data as little as possible because of the implementation of stringent data privacy legislation like the GDPR in the EU and CCPA in the US.

Data security is everyone’s responsibility today, and an increasing number of organizations rely on data masking to protect the data collectively, avoid security breaches, and ensure compliance.

But what exactly does data masking mean, and how does it work? Read on to find out.

What Is Data Masking Anyway?

Data obfuscation, pseudonymization, or data anonymization are other names for data masking, which is the process of creating a fake but realistic version of your organizational data. It’s designed to protect sensitive data by providing a functional alternative when real data isn’t required—for instance, during a sales demo, software testing, or user training.

To give you more perspective, try imagining a scenario where your team is working with a contractor to develop a database. Masking your data will allow the contractor to test the database environment without needing access to actual sensitive customer information.

There are several ways to alter this data, like character shuffling, data encryption, and word or character substitution. Each of these processes changes the values of the data, but they continue using the same format. They try to create a version of your data that cannot be deciphered or reverse engineered, which is why you’re assured of greater data security.

How Data Masking Works

Data masking works by shielding confidential data like credit card information, names, addresses, phone numbers, and Social Security numbers from unintended exposure to unauthorized people. This way, it successfully minimizes data breaches by masking test and development environments created from production data regardless of database, location, or platform.

Understanding the Different Data Masking Techniques

While there are many data masking techniques, you should select one based on the nature of your data and the scope of testing. Let’s take a look at the different techniques in more detail below:

  • Anagramming or Shuffling. The order of the characters or digits is shuffled for every entry. So, “Paul” and “8299” will become “Upla” and “9298,” respectively.
  • Nulling. Data values are replaced with placeholder characters or returned as blank.
  • Encryption. Companies encrypt their sensitive data before exporting it. This data can only be decrypted by anyone who has the key or password to do so.
  • Substitution. Every value is replaced with a random selection of appropriate substitute values. For example, you can compile a list of non-functional credit card numbers and then swap it with real credit card numbers during the masking process.
  • Stochastic Substitution. This method looks at the variance between values in the field and creates a random value within that range. For instance, if you have dates that fall within six months as values, the masking algorithm will create a set of appropriately distributed random dates within the same six-month period.

Moreover, each of these methods can be applied statically or dynamically, which is where static data masking and dynamic data masking come into play.

Static Data Masking
In this approach, the masking rules are applied at the source. Exactly why it’s unlikely for your sensitive data to get exposed as the original copy is masked. But you cannot use this masked data for any purpose which requires unmasked data.

If you plan on cloning production data to non-production systems for software development or software testing, it’s best to employ static data masking.

You can copy the realistic production data because you need data “realism“—but without exposing the actual sensitive data. Doing this will allow production users to see the sensitive data, but will keep it protected from developers, testers, and admins.

Dynamic Data Masking
This is a more flexible approach and better suited to a continuous testing environment.

Here, masking is only applied to outgoing exports according to predefined data rules based on factors such as user access level, API call arguments, or any other factor that may require additional data security. Plus, you can apply different types of masking rules to ensure every scenario returns the most appropriate set of data.

In a production system, a good example of masking in-flight data is dealing with the various levels of access or privilege by different users. You can obscure or obfuscate sensitive data that you don’t want others to see through dynamic or in-flight data masking.

Example 1: Samsung

Samsung seeks to analyze and produce smart devices like mobile phones and TV products around the world. But before performing product analysis, the electronics giant has to protect personal private information as per the rules and procedures of the local regulation.

To ensure legal compliance with personal privacy, Samsung used Dataguise’s tool for protecting sensitive data assets in Hadoop. This tool automatically discovers consumer privacy data and encrypts it before migrating the data to an AWS analytics tool. As such, only authorized users can access and perform analytics on real data.

Example 2: Independence Health Group

Independent Health Group wanted a team of on- and off-shore developers to test applications using real data. But before anything could happen, they had to mask PHI and other personally identifiable information.

The health insurance company decided to use Informatica Dynamic Data Masking to disguise sensitive information, such as member names, birthdates, Social Security numbers, and other data in real-time as developers pull down data sets.

How to Get Started With Data Masking

Let’s take a look at how you can carry out data masking successfully.

Step 1: Understand the Project Scope

You must know what information needs to be protected, who has the authority to see it, which applications use the data, and where the data will be stored (both in production and non-production domains).

Without knowing this, you won’t be able to perform data masking effectively.

And while this may seem easy, you must be prepared to put in lots of effort and have an action plan in place as a separate stage of the project, especially if you’re dealing with complex operations and multiple lines of business.

Step 2: Maintain Referential Integrity

Referential integrity means that every type of information coming from a business application must be marked using the same algorithm.

More often than not, it isn’t feasible for large organizations to use a single data masking tool. Instead, they should implement their own data masking for every line of business to avoid complications and meet budget/business requirements, follow different IT administration practices, and fulfill different security/regulatory requirements.

Also, you have to ensure that the different data masking tools and practices are synchronized across your organization when dealing with the same type of data. Trust us, this can work wonders to minimize challenges when you use the data across business lines.

Step 3: Protect Your Data Masking Algorithms

You should consider how to secure your data masking algorithms, as well as protect alternative data sets or dictionaries that are used to scramble the data. Only authorized users should have access to the real data since these algorithms are extremely sensitive.

If any bad actor learns which repeatable masking algorithms are being used, they can reverse engineer large blocks of sensitive information, putting your data at risk.

We recommend ensuring a separation of duties to mitigate risks. For instance, you can have IT security personnel decide which methods and algorithms will be used in general, but give data owners exclusive access to specific algorithm settings and data lists in the relevant department. This will again protect your data masking algorithms, keeping your sensitive data secure.

Incredible companies use Nira

Every company that uses Google Workspace should be using Nira.
Bryan Wise
Bryan Wise,
Former VP of IT at GitLab

Incredible companies use Nira