Protecting Data

Defining PII /SII and how to anonymize your data for use in Generative AI like ChatGPT

With so much data out there, it's easy to lose track of what actually needs to be protected. To start we will focus on two key data classifications:

PII & SII

PII stands for Personally Identifiable Information and SII stands for Sensitive Identifiable Information. PII is information that can be used to identify an individual, while SII is information that requires special handling and protection due to its sensitive nature.

Examples of PII:

  1. Full name

  2. Social Security number

  3. Date of birth

  4. Home address

  5. Email address

  6. Phone number

  7. Driver's license number

  8. Passport number

  9. Credit card number

  10. Bank account number

Examples of SII:

  1. Medical records

  2. Financial information

  3. Criminal records

  4. Biometric data

  5. Employment records

  6. Education records

  7. Genetic data

  8. Information related to legal proceedings

  9. Personal communications

  10. National security information

Techniques to remove PII and SII from data

Here are a couple different techniques that can be used to programmatically remove PII and SII from data:

  1. Data masking: Replacing the sensitive information with asterisks or other characters to hide it from view.

  2. Data encryption: Using encryption algorithms to protect sensitive information while it is stored or transmitted.

  3. Data deletion: Deleting the sensitive information from the data set altogether.

  4. Data tokenization: Replacing sensitive information with randomly generated tokens that can be used as substitutes without revealing the actual data.

  5. Anonymization: Removing or obfuscating all identifying information so that it cannot be linked to a specific individual.

  6. Data redaction: Removing specific information from a document or record while leaving the rest intact.

  7. Differential privacy: Adding random noise to data to make it difficult to identify individuals.

  8. Data minimization: Collecting and retaining only the minimum amount of data necessary for a given purpose.

  9. Access controls: Limiting access to sensitive data only to authorized personnel.

  10. Data retention policies: Setting limits on how long sensitive data can be stored before it is permanently deleted.

Knowing the technical terms for these data protection techniques, you can now use ChatGPT to provide examples of this.

Prompt

chatgpt-3.5
provide me a javascript sample for data masking customer names

Response

Sure, here's a simple example of how you could use JavaScript to mask customer names:

javascriptCopy code// Example customer names
const customerNames = [
  "John Smith",
  "Jane Doe",
  "Bob Johnson",
  "Sarah Lee"
];

// Function to mask a given name
function maskName(name) {
  const parts = name.split(" ");
  const maskedLast = parts[parts.length - 1][0] + "*".repeat(parts[parts.length - 1].length - 1);
  return parts.slice(0, -1).join(" ") + " " + maskedLast;
}

// Mask all customer names
const maskedNames = customerNames.map(name => maskName(name));

console.log(maskedNames);
// Output: ["John S****", "Jane D**", "Bob J******", "Sarah L**"]

This example defines an array of customer names and a maskName function that takes a name and returns a masked version of it. The maskName function splits the name into parts (first name and last name), masks all but the first letter of the last name, and then rejoins the parts.

Finally, the map function is used to apply the maskName function to all customer names in the array, and the masked names are logged to the console.

Discussion

It returns a script with some context, but obviously your names are not hardcoded into the sample ChatGPT provided to you. Let’s send another prompt of how you would do this with a CSV file on your desktop: “update the sample above with how I would I load names and export a new CSV from a file on my desktop”. Now you have a file that allows you to parse out the names. Addressing the question of whether to build or buy data protection software is something you or your organization is going to have to decide on. The bottom line is you need to implement a process to protect your data and public tools should never be used on data that requires protection. We talked a lot about specific data elements, but these same protections should be applied to more technical prompts such as keys, secrets, proprietary patterns, and infrastructure that may open you up to risk of attacks.

It is important that if you are working within an organization that you read your security and compliance policies and spend time aligning those requirements to their corresponding protection processes. Work with your IT security and compliance teams to get the right tooling and processes in place to protect yourself while realizing the productivity gains of AI driven development.

Generative References

chatgpt-3.5
what is pii/sii
how to mask pii/sii
write a code sample masking pii/sii

Last updated