One of the key requirements while masking data in large databases or multi database environment is to consistently mask some columns, i.e. for a given input the output should always be the same. At the same time the masked output should not be predictable.
Deterministic masking also eliminates the need to spend enormous amount of time spent in identifying data relationships, i.e. parent and child relationships among columns defined in the application tables.
In this blog post I will explain different ways of consistently masking the data across databases using Oracle Data Masking and Subsetting
The readers of post should have minimal knowledge on Oracle Enterprise Manager 12c, Application Data Modeling, Data Masking concepts. For more information on these concepts, please refer to Oracle Data Masking and Subsetting document
Oracle Data Masking and Subsetting 12c provides four methods using which users can consistently yet irreversibly mask their inputs.
1. Substitute
2. SQL Expression
3. Encrypt
4. User Defined Function
SUBSTITUTE
The substitute masking format replaces the original value with a value from a pre-created database table. As the method uses a hash based algorithm in the back end the mappings are consistent.
For example consider DEPARTMENT_ID in EMPLOYEES table is replaced with FAKE_DEPARTMENT_ID from FAKE_TABLE. The substitute masking transformation that all occurrences of DEPARTMENT_ID say ‘101’ will be replaced with ‘502’ provided same substitution table and column is used , i.e. FAKE_TABLE.FAKE_DEPARTMENT_ID.
The following screen shot shows the usage of the Substitute masking format with in a masking definition:
Note that the uniqueness of the masked value depends on the number of columns being used in the substitution table i.e. if the original table contains 50000 unique values, then for the masked output to be unique and deterministic the substitution column should also contain 50000 unique values without which only consistency is maintained but not uniqueness.
SQL EXPRESSION
SQL Expression replaces an existing value with the output of a specified SQL Expression.
For example while masking an EMPLOYEES table the EMAIL_ID of an
employee has to be in the format EMPLOYEE’s
[email protected] while FIRST_NAME and LAST_NAME are the actual column names of the EMPLOYEES table then the corresponding SQL Expression will look like %FIRST_NAME%||’.’||%LAST_NAME%||’@COMPANY.COM’.
The advantage of this technique is that if you are masking FIRST_NAME and LAST_NAME of the EMPLOYEES table than the corresponding EMAIL ID will be replaced accordingly by the masking scripts.
One of the interesting aspect’s of a SQL Expressions is that you can use sub SQL expressions, which means that you can write a nested SQL and use it as SQL Expression to address a complex masking business use cases.
SQL Expression can also be used to consistently replace value with hashed value using Oracle’s PL/SQL function ORA_HASH.
The following SQL Expression will help in the previous example for replacing the DEPARTMENT_IDs with a hashed number
ORA_HASH (%DEPARTMENT_ID%, 1000)
The following screen shot shows the usage of encrypt masking format with in the masking definition:
ORA_HASH takes three arguments:
1. Expression which can be of any data type except LONG, LOB, User Defined Type [nested table type is allowed]. In the above example I used the Original value as expression.
2. Number of hash buckets which can be number between 0 and 4294967295. The default value is 4294967295.
You can also co-relate the number of hash buckets to a range of numbers. In the above example above the bucket value is specified as 1000, so the end result will be a hashed number in between 0 and 1000.
3. Seed, can be any number which decides the consistency, i.e. for a given seed value the output will always be same. The default seed is 0. In the above SQL Expression a seed in not specified, so it to 0. If you have to use a non default seed then the function will look like.
ORA_HASH (%DEPARTMENT_ID%, 1000, 1234
The uniqueness depends on the input and the number of hash buckets used. However as ORA_HASH uses a 32 bit algorithm, considering birthday paradox or pigeonhole principle there is a 0.5 probability of collision after 232-1 unique values.
ENCRYPT
Encrypt masking format uses a blend of 3DES encryption algorithm, hashing, and regular expression to produce a deterministic and unique masked output. The format of the masked output corresponds to the specified regular expression. As this technique uses a key [string] to encrypt the data, the same string can be used to decrypt the data. The key also acts as seed to maintain consistent outputs for a given input.
The following screen shot shows the usage of encrypt masking format with in the masking definition:
Regular Expressions may look complex for the first time users but you will soon realize that it’s a simple language. There are many resources in internet, oracle documentation, oracle learning library, my oracle support on writing a Regular Expressions, out of all the following My Oracle Support document helped me to get started with Regular Expressions:
Oracle SQL Support for Regular Expressions[Video](Doc ID 1369668.1)
USER DEFINED FUNCTION [UDF]
User Defined Function or UDF provides flexibility for the users to code their own masking logic in PL/SQL, which can be called from masking Defintion.
The standard format of an UDF in Oracle Data Masking and Subsetting is:
Function udf_func (rowid varchar2, column_name varchar2, original_value varchar2) returns varchar2;
Where
• rowid is the row identifier of the column that needs to be masked
• column_name is the name of the column that needs to be masked
• original_value is the column value that needs to be masked
You can achieve deterministic masking by using Oracle’s built in hash functions like, ORA_HASH, DBMS_CRYPTO.MD4, DBMS_CRYPTO.MD5, DBMS_UTILITY. GET_HASH_VALUE.Please refers to the Oracle Database Documentation for more information on the Oracle Hash functions.
For example the following masking UDF generate deterministic unique hexadecimal values for a given string input:
CREATE OR REPLACE FUNCTION RD_DUX (rid varchar2, column_name varchar2, orig_val VARCHAR2) RETURN VARCHAR2
DETERMINISTIC PARALLEL_ENABLE IS
stext varchar2 (26);
no_of_characters number(2);
BEGIN
no_of_characters:=6;
stext:=substr(RAWTOHEX(DBMS_CRYPTO.HASH(UTL_RAW.CAST_TO_RAW(text),1)),0,no_of_characters);
RETURN stext;
END;
The uniqueness depends on the input and length of the string and number of bits used by hash algorithm. In the above function MD4 hash is used [denoted by argument 1 in the DBMS_CRYPTO.HASH function which is a 128 bit algorithm which produces 2^128-1 unique hashed values , however this is limited by the length of the input string which is 6, so only 6^6 unique values will be generated. Also do not forget about the birthday paradox/pigeonhole principle mentioned earlier in this post.
An another example is to consistently replace characters or numbers preserving the length and special characters as shown below:
CREATE OR REPLACE FUNCTION RD_DUS(rid varchar2,column_name varchar2,orig_val VARCHAR2) RETURN VARCHAR2
DETERMINISTIC PARALLEL_ENABLE IS
stext varchar2(26);
BEGIN
DBMS_RANDOM.SEED(orig_val);
stext:=TRANSLATE(orig_val,'ABCDEFGHILKLMNOPQRSTUVWXYZ',DBMS_RANDOM.STRING('U',26));
stext:=TRANSLATE(stext,'abcdefghijklmnopqrstuvwxyz',DBMS_RANDOM.STRING('L',26));
stext:=TRANSLATE(stext,'0123456789',to_char(DBMS_RANDOM.VALUE(1,9)));
stext:=REPLACE(stext,'.','0');
RETURN stext;
END;
The following screen shot shows the usage of an UDF with in a masking definition:
To summarize, Oracle Data Masking and Subsetting helps you to consistently mask data across databases using one or all of the methods described in this post. It saves the hassle of identifying the parent-child relationships defined in the application table.
Happy Masking