Tuesday, June 11, 2013

Ruminating on Data Masking

A lot of organizations are interested in 'Data Masking' and are actively looking out for solutions around the same. IBM and Informatica Data Masking tools are leaders in Gartner's magic quadrant.

The need for masking data is very simple - How do we share enterprise data that is sensitive with the development teams, testing teams, training teams and even the offshore teams?
Besides masking data, there are other potential solutions for the above problem - i.e. using Test Data Creation tools and UI playback tools. But data masking and subsetting continue to remain popular means of scrambling data for non-production use.

Some of the key requirements for any Data Masking Solution are:
  1. Meaningful Masked Data: The masked data has to be meaningful and realistic. It should be capable of applying and satisfying all the business rules. For e.g. post codes, credit card numbers, SSN, bank account numbers, etc. E.g. if we change DOB, should we also change 'Age'. 
  2. Referential Integrity: If we are scrambling primary keys then we need to ensure that the relationships are maintained. One technique is to make sure that the same scramble functions are applied to all of the related columns. Sometimes, if we are masking data across databases, then we would need to ensure integrity across databases.
  3. Irreversible Masking: The masked data should be irreversible and it should be impossible to  recreate sensitive data. 
A good architecture strategy for building a data-masking solution is to design a Policy driven Data Masking Rule Engine. The business users can then define policies for masking different data-sets.

A lot of data masking tool vendors are now venturing beyond static data masking. Dynamic Data Masking is a new concept that masks data in real time. Also there is a growing demand for masking data in unstructured content such as PDF, Word or Excel files.