Defining PII /SII and how to anonymize your data for use in Generative AI like ChatGPT
With so much data out there, it's easy to lose track of what actually needs to be protected. To start we will focus on two key data classifications:
PII stands for Personally Identifiable Information and SII stands for Sensitive Identifiable Information. PII is information that can be used to identify an individual, while SII is information that requires special handling and protection due to its sensitive nature.
Full name
Social Security number
Date of birth
Home address
Email address
Phone number
Driver's license number
Passport number
Credit card number
Bank account number
Medical records
Financial information
Criminal records
Biometric data
Employment records
Education records
Genetic data
Information related to legal proceedings
Personal communications
National security information
Here are a couple different techniques that can be used to programmatically remove PII and SII from data:
Data masking: Replacing the sensitive information with asterisks or other characters to hide it from view.
Data encryption: Using encryption algorithms to protect sensitive information while it is stored or transmitted.
Data deletion: Deleting the sensitive information from the data set altogether.
Data tokenization: Replacing sensitive information with randomly generated tokens that can be used as substitutes without revealing the actual data.
Anonymization: Removing or obfuscating all identifying information so that it cannot be linked to a specific individual.
Data redaction: Removing specific information from a document or record while leaving the rest intact.
Differential privacy: Adding random noise to data to make it difficult to identify individuals.
Data minimization: Collecting and retaining only the minimum amount of data necessary for a given purpose.
Access controls: Limiting access to sensitive data only to authorized personnel.
Data retention policies: Setting limits on how long sensitive data can be stored before it is permanently deleted.
Knowing the technical terms for these data protection techniques, you can now use ChatGPT to provide examples of this.
Sure, here's a simple example of how you could use JavaScript to mask customer names:
This example defines an array of customer names and a maskName
function that takes a name and returns a masked version of it. The maskName
function splits the name into parts (first name and last name), masks all but the first letter of the last name, and then rejoins the parts.
Finally, the map
function is used to apply the maskName
function to all customer names in the array, and the masked names are logged to the console.
It returns a script with some context, but obviously your names are not hardcoded into the sample ChatGPT provided to you. Let’s send another prompt of how you would do this with a CSV file on your desktop: “update the sample above with how I would I load names and export a new CSV from a file on my desktop”. Now you have a file that allows you to parse out the names. Addressing the question of whether to build or buy data protection software is something you or your organization is going to have to decide on. The bottom line is you need to implement a process to protect your data and public tools should never be used on data that requires protection. We talked a lot about specific data elements, but these same protections should be applied to more technical prompts such as keys, secrets, proprietary patterns, and infrastructure that may open you up to risk of attacks.
It is important that if you are working within an organization that you read your security and compliance policies and spend time aligning those requirements to their corresponding protection processes. Work with your IT security and compliance teams to get the right tooling and processes in place to protect yourself while realizing the productivity gains of AI driven development.