From the course: CompTIA Security (SY0-701) Cert Prep: 1 General Security Concepts

Data de-identification

One way that many organizations seek to protect themselves against accidental disclosures of personal information is to remove all identifying information from datasets, when that identifying information is not necessary to meet business requirements. Deidentification is the process of moving through a dataset and removing data that may be individually identifying. For example, you would certainly want to remove names, Social Security numbers, and other obvious identifiers. However, simple data deidentification is often insufficient to completely safeguard information. The reason for this is that you can often combine seemingly innocuous fields to uniquely identify an individual. A study done at Carnegie Mellon University analyze three fields commonly retained in deidentified datasets; zip code, date of birth, and gender. You wouldn't think that any one of these fields, when used alone, would allow you to identify someone. After all, a lot of people live in the same town as me, and there are a lot of people on the planet who are born on the same day I was born. However, the danger comes when you combine them all. That Carnegie Mellon study found that these three elements together uniquely identify 87% of people in the United States. So while there may indeed be many people in my town and many people born on the same day as me in the world, there's an 87% chance that I am the only male in my town born on my birthday. What this means for us is that we need to be much more careful with protecting data than simply removing obvious identifiers. Instead of just deidentifying data, we need to anonymize our data, making it almost impossible for someone to figure out the identity of an individual person. The HIPAA standards include a rigorous process for anonymizing data that's widely accepted in the analytics community. It offers two pathways to clearing a dataset. First, you can have statisticians analyze your dataset and validate that it would be very unlikely that it could disclose the identity of an individual. This pathway requires access to professional statisticians, and it does include the possibility of an accidental disclosure. Alternatively, you can opt to use the safe harbor approach that requires eliminating 18 data elements from your dataset that might be combined with each other to reveal an individual's identity. I won't read you this whole list, but you're welcome to peruse it on the US Department of Health and Human Services website. It includes things like Social Security numbers and email addresses, as well as date of birth and zip code. Whatever method you choose for data deidentification and anonymization, make sure that you've thought through this issue carefully and that you're taking appropriate steps to protect the privacy of your data subjects.

Contents