De-identification

De-identification is the process used to prevent someone's personal identity from being revealed. For example, data produced during human subject research might be de-identified to preserve the privacy of research participants. Biological data may be de-identified in order to comply with HIPAA regulations that define and stipulate patient privacy laws. ^[1]

When applied to metadata or general data about identification, the process is also known as data anonymization. Common strategies include deleting or masking personal identifiers, such as personal name, and suppressing or generalizing quasi-identifiers, such as date of birth. The reverse process of using de-identified data to identify individuals is known as data re-identification. Successful re-identifications^[2]^[3]^[4]^[5] cast doubt on de-identification's effectiveness. A systematic review of fourteen distinct re-identification attacks found "a high re-identification rate […] dominated by small-scale studies on data that was not de-identified according to existing standards".^[6]

De-identification is adopted as one of the main approaches toward data privacy protection.^[7] It is commonly used in fields of communications, multimedia, biometrics, big data, cloud computing, data mining, internet, social networks, and audio–video surveillance.^[8]

Examples

In designing surveys

When surveys are conducted, such as a census, they collect information about a specific group of people. To encourage participation and to protect the privacy of survey respondents, the researchers attempt to design the survey in a way that when people participate in a survey, it will not be possible to match any participant's individual response(s) with any data published.^[9]

Before using information

When an online shopping website wants to know its users' preferences and shopping habits, it decides to retrieve customers' data from its database and do analysis on them. The personal data information include personal identifiers which were collected directly when customers created their accounts. The website needs to pre-handle the data through de-identification techniques before analyzing data records to avoid violating their customers' privacy.

Anonymization

Anonymization refers to irreversibly severing a data set from the identity of the data contributor in a study to prevent any future re-identification, even by the study organizers under any condition.^[10]^[11] De-identification may also include preserving identifying information which can only be re-linked by a trusted party in certain situations.^[10]^[11]^[12] There is a debate in the technology community on whether data that can be re-linked, even by a trusted party, should ever be considered de-identified.^[13]

Techniques

Common strategies of de-identification are masking personal identifiers and generalizing quasi-identifiers. Pseudonymization is the main technique used to mask personal identifiers from data records, and k-anonymization is usually adopted for generalizing quasi-identifiers.

Pseudonymization

Pseudonymization is performed by replacing real names with a temporary ID. It deletes or masks personal identifiers to make individuals unidentified. This method makes it possible to track the individual's record over time, even though the record will be updated. However, it can not prevent the individual from being identified if some specific combinations of attributes in the data record indirectly identify the individual.^[14]

k-anonymization

k-anonymization defines attributes that indirectly points to the individual's identity as quasi-identifiers (QIs) and deal with data by making at least k individuals have some combination of QI values.^[14] QI values are handled following specific standards. For example, the k-anonymization replaces some original data in the records with new range values and keep some values unchanged. New combination of QI values prevents the individual from being identified and also avoid destroying data records.

Applications

Research into de-identification is driven mostly for protecting health information.^[15] Some libraries have adopted methods used in the healthcare industry to preserve their readers' privacy.^[15]

In big data, de-identification is widely adopted by individuals and organizations.^[8] With the development of social media, e-commerce, and big data, de-identification is sometimes required and often used for data privacy when users' personal data are collected by companies or third-party organizations who will analyze it for their own personal usage.

In smart cities, de-identification may be required to protect the privacy of residents, workers and visitors. Without strict regulation, de-identification may be difficult because sensors can still collect information without consent.^[16]

Data De-identification

PHI (Protected Health Information) can be present in various data and each format need specific techniques and tools for de-identify it:

For Text de-identification is using rule based and NLP (Natural language processing) approaches.
Pdf de-identification is based on text de-identification, also required in most cases OCR and specific techniques for hide PHI in PDF.^[17]
DICOM de-identification required to clean metadata, pixel data, encapsulated documents.

Limits

Whenever a person participates in genetics research, the donation of a biological specimen often results in the creation of a large amount of personalized data. Such data is uniquely difficult to de-identify.^[18]

Anonymization of genetic data is particularly difficult because of the huge amount of genotypic information in biospecimens,^[18] the ties that specimens often have to medical history,^[19] and the advent of modern bioinformatics tools for data mining.^[19] There have been demonstrations that data for individuals in aggregate collections of genotypic data sets can be tied to the identities of the specimen donors.^[20]

Some researchers have suggested that it is not reasonable to ever promise participants in genetics research that they can retain their anonymity, but instead, such participants should be taught the limits of using coded identifiers in a de-identification process.^[11]

De-identification laws in the United States of America

In May 2014, the United States President's Council of Advisors on Science and Technology found de-identification "somewhat useful as an added safeguard" but not "a useful basis for policy" as "it is not robust against near‐term future re‐identification methods".^[21]

The HIPAA Privacy Rule provides mechanisms for using and disclosing health data responsibly without the need for patient consent. These mechanisms center on two HIPAA de-identification standards – Safe Harbor and the Expert Determination Method. Safe harbor relies on the removal of specific patient identifiers (e.g. name, phone number, email address, etc.), while the Expert Determination Method requires knowledge and experience with generally accepted statistical and scientific principles and methods to render information not individually identifiable.^[22]

Safe harbor

The safe harbor method uses a list approach to de-identification and has two requirements:

The removal or generalization of 18 elements from the data.
That the Covered Entity or Business Associate does not have actual knowledge that the residual information in the data could be used alone, or in combination with other information, to identify an individual. Safe Harbor is a highly prescriptive approach to de-identification. Under this method, all dates must be generalized to year and zip codes reduced to three digits. The same approach is used on the data regardless of the context. Even if the information is to be shared with a trusted researcher who wishes to analyze the data for seasonal variations in acute respiratory cases and, thus, requires the month of hospital admission, this information cannot be provided; only the year of admission would be retained.

Expert Determination

Expert Determination takes a risk-based approach to de-identification that applies current standards and best practices from the research to determine the likelihood that a person could be identified from their protected health information. This method requires that a person with appropriate knowledge of and experience with generally accepted statistical and scientific principles and methods render the information not individually identifiable. It requires:

That the risk is very small that the information could be used alone, or in combination with other reasonably available information, by an anticipated recipient to identify an individual who is a subject of the information;
Documents the methods and results of the analysis that justify such a determination.

Research on decedents

The key law about research in electronic health record data is HIPAA Privacy Rule. This law allows use of electronic health record of deceased subjects for research (HIPAA Privacy Rule (section 164.512(i)(1)(iii))).^[23]

References

^ Rights (OCR), Office for Civil (2012-09-07). "Methods for De-identification of PHI". HHS.gov. Retrieved 2020-11-08.
^ Sweeney, L. (2000). "Simple Demographics Often Identify People Uniquely". Data Privacy Working Paper. 3.
^ de Montjoye, Y.-A. (2013). "Unique in the crowd: The privacy bounds of human mobility". Scientific Reports. 3: 1376. Bibcode:2013NatSR...3E1376D. doi:10.1038/srep01376. PMC 3607247. PMID 23524645.
^ de Montjoye, Y.-A.; Radaelli, L.; Singh, V. K.; Pentland, A. S. (29 January 2015). "Unique in the shopping mall: On the reidentifiability of credit card metadata". Science. 347 (6221): 536–539. Bibcode:2015Sci...347..536D. doi:10.1126/science.1256297. hdl:1721.1/96321. PMID 25635097.
^ Narayanan, A. (2006). "How to break anonymity of the netflix prize dataset". arXiv:cs/0610105.
^ El Emam, Khaled (2011). "A Systematic Review of Re-Identification Attacks on Health Data". PLOS ONE. 10 (4): e28071. Bibcode:2011PLoSO...628071E. doi:10.1371/journal.pone.0028071. PMC 3229505. PMID 22164229.
^ Simson., Garfinkel. De-identification of personal information : recommendation for transitioning the use of cryptographic algorithms and key lengths. OCLC 933741839.
^ ^a ^b Ribaric, Slobodan; Ariyaeeinia, Aladdin; Pavesic, Nikola (September 2016). "De-identification for privacy protection in multimedia content: A survey". Signal Processing: Image Communication. 47: 131–151. doi:10.1016/j.image.2016.05.020. hdl:2299/19652.
^ Bhaskaran, Vivek (2023-06-08). "Survey Research: Definition, Examples and Methods". QuestionPro. Retrieved 2023-12-17.
^ ^a ^b Godard, B. A.; Schmidtke, J. R.; Cassiman, J. J.; Aymé, S. G. N. (2003). "Data storage and DNA banking for biomedical research: Informed consent, confidentiality, quality issues, ownership, return of benefits. A professional perspective". European Journal of Human Genetics. 11: S88–122. doi:10.1038/sj.ejhg.5201114. PMID 14718939.
^ ^a ^b ^c Fullerton, S. M.; Anderson, N. R.; Guzauskas, G.; Freeman, D.; Fryer-Edwards, K. (2010). "Meeting the Governance Challenges of Next-Generation Biorepository Research". Science Translational Medicine. 2 (15): 15cm3. doi:10.1126/scitranslmed.3000361. PMC 3038212. PMID 20371468.
^ McMurry, AJ; Gilbert, CA; Reis, BY; Chueh, HC; Kohane, IS; Mandl, KD (2007). "A self-scaling, distributed information architecture for public health, research, and clinical care". J Am Med Inform Assoc. 14 (4): 527–33. doi:10.1197/jamia.M2371. PMC 2244902. PMID 17460129.
^ "Data de-identification". The Abdul Latif Jameel Poverty Action Lab (J-PAL). Retrieved 2023-12-17.
^ ^a ^b Ito, Koichi; Kogure, Jun; Shimoyama, Takeshi; Tsuda, Hiroshi (2016). "De-identification and Encryption Technologies to Protect Personal Information" (PDF). Fujitsu Scientific and Technical Journal. 52 (3): 28–36.
^ ^a ^b Nicholson, S.; Smith, C. A. (2005). "Using lessons from health care to protect the privacy of library users: Guidelines for the de-identification of library data based on HIPAA" (PDF). Proceedings of the American Society for Information Science and Technology. 42: n/a. doi:10.1002/meet.1450420106.
^ Coop, Alex. "Sidewalk Labs decision to offload tough decisions on privacy to third party is wrong, says its former consultant". IT World Canada. Retrieved 27 June 2019.
^ "Medical PDF De-identification: Ensuring Patient Privacy and Compliance in Document Management". 2024.
^ ^a ^b McGuire, A. L.; Gibbs, R. A. (2006). "GENETICS: No Longer De-Identified". Science. 312 (5772): 370–371. doi:10.1126/science.1125339. PMID 16627725.
^ ^a ^b Thorisson, G. A.; Muilu, J.; Brookes, A. J. (2009). "Genotype–phenotype databases: Challenges and solutions for the post-genomic era". Nature Reviews Genetics. 10 (1): 9–18. doi:10.1038/nrg2483. hdl:2381/4584. PMID 19065136. S2CID 5964522.
^ Homer, N.; Szelinger, S.; Redman, M.; Duggan, D.; Tembe, W.; Muehling, J.; Pearson, J. V.; Stephan, D. A.; Nelson, S. F.; Craig, D. W. (2008). Visscher, Peter M. (ed.). "Resolving Individuals Contributing Trace Amounts of DNA to Highly Complex Mixtures Using High-Density SNP Genotyping Microarrays". PLOS Genetics. 4 (8): e1000167. doi:10.1371/journal.pgen.1000167. PMC 2516199. PMID 18769715.
^ PCAST. "Report to the President - Big Data and Privacy: A technological perspective" (PDF). Office of Science and Technology Policy. Retrieved 28 March 2016 – via National Archives.
^ "De-Identification 201". Privacy Analytics. 2015.
^ 45 CFR 164.512)

External links

Simson L. Garfinkel (2015-12-16). "NISTIR 8053, De-Identification of Personal Information" (PDF). NIST. Retrieved 2016-01-03.
A training series Archived 2015-11-13 at the Wayback Machine on United States government de-identification standards
Guidance Regarding Methods for De-identification of Protected Health Information Archived 2015-12-10 at the Wayback Machine
Ohm, Paul (2010). "Broken Promises of Privacy: Responding to the Surprising Failure of Anonymization" (PDF). UCLA Law Review. 57: 1701–77.
Padilla-López, José Ramón; Chaaraoui, Alexandros Andre; Flórez-Revuelta, Francisco (June 2015). "Visual privacy protection methods: A survey" (PDF). Expert Systems with Applications. 42 (9): 4177–4195. doi:10.1016/j.eswa.2015.01.041. hdl:10045/44523. S2CID 6794899.
Chaaraoui, Alexandros; Padilla-López, José; Ferrández-Pastor, Francisco; Nieto-Hidalgo, Mario; Flórez-Revuelta, Francisco (20 May 2014). "A Vision-Based System for Intelligent Monitoring: Human Behaviour Analysis and Privacy by Context". Sensors. 14 (5): 8895–8925. Bibcode:2014Senso..14.8895C. doi:10.3390/s140508895. PMC 4063058. PMID 24854209.

[1] Rights (OCR), Office for Civil (2012-09-07). "Methods for De-identification of PHI". HHS.gov. Retrieved 2020-11-08.

[sweeney2000-2] Sweeney, L. (2000). "Simple Demographics Often Identify People Uniquely". Data Privacy Working Paper. 3.

[demontjoye2013-3] Montjoye, Y.-A. (2013). "Unique in the crowd: The privacy bounds of human mobility". Scientific Reports. 3: 1376. Bibcode:2013NatSR...3E1376D. doi:10.1038/srep01376. PMC 3607247. PMID 23524645.

[demontjoye2015-4] Montjoye, Y.-A.; Radaelli, L.; Singh, V. K.; Pentland, A. S. (29 January 2015). "Unique in the shopping mall: On the reidentifiability of credit card metadata". Science. 347 (6221): 536–539. Bibcode:2015Sci...347..536D. doi:10.1126/science.1256297. hdl:1721.1/96321. PMID 25635097.

[narayanan2006-5] Narayanan, A. (2006). "How to break anonymity of the netflix prize dataset". arXiv:cs/0610105.

[Malin,_El_Emam,_et_al-6] El Emam, Khaled (2011). "A Systematic Review of Re-Identification Attacks on Health Data". PLOS ONE. 10 (4): e28071. Bibcode:2011PLoSO...628071E. doi:10.1371/journal.pone.0028071. PMC 3229505. PMID 22164229.

[7] Simson., Garfinkel. De-identification of personal information : recommendation for transitioning the use of cryptographic algorithms and key lengths. OCLC 933741839.

[:0-8] Ribaric, Slobodan; Ariyaeeinia, Aladdin; Pavesic, Nikola (September 2016). "De-identification for privacy protection in multimedia content: A survey". Signal Processing: Image Communication. 47: 131–151. doi:10.1016/j.image.2016.05.020. hdl:2299/19652.

[9] Bhaskaran, Vivek (2023-06-08). "Survey Research: Definition, Examples and Methods". QuestionPro. Retrieved 2023-12-17.

[storage-10] Godard, B. A.; Schmidtke, J. R.; Cassiman, J. J.; Aymé, S. G. N. (2003). "Data storage and DNA banking for biomedical research: Informed consent, confidentiality, quality issues, ownership, return of benefits. A professional perspective". European Journal of Human Genetics. 11: S88–122. doi:10.1038/sj.ejhg.5201114. PMID 14718939.

[meeting-11] Fullerton, S. M.; Anderson, N. R.; Guzauskas, G.; Freeman, D.; Fryer-Edwards, K. (2010). "Meeting the Governance Challenges of Next-Generation Biorepository Research". Science Translational Medicine. 2 (15): 15cm3. doi:10.1126/scitranslmed.3000361. PMC 3038212. PMID 20371468.

[publichealth-12] McMurry, AJ; Gilbert, CA; Reis, BY; Chueh, HC; Kohane, IS; Mandl, KD (2007). "A self-scaling, distributed information architecture for public health, research, and clinical care". J Am Med Inform Assoc. 14 (4): 527–33. doi:10.1197/jamia.M2371. PMC 2244902. PMID 17460129.

[13] "Data de-identification". The Abdul Latif Jameel Poverty Action Lab (J-PAL). Retrieved 2023-12-17.

[:1-14] Ito, Koichi; Kogure, Jun; Shimoyama, Takeshi; Tsuda, Hiroshi (2016). "De-identification and Encryption Technologies to Protect Personal Information" (PDF). Fujitsu Scientific and Technical Journal. 52 (3): 28–36.

[library-15] Nicholson, S.; Smith, C. A. (2005). "Using lessons from health care to protect the privacy of library users: Guidelines for the de-identification of library data based on HIPAA" (PDF). Proceedings of the American Society for Information Science and Technology. 42: n/a. doi:10.1002/meet.1450420106.

[IT_World_June_2019-16] Coop, Alex. "Sidewalk Labs decision to offload tough decisions on privacy to third party is wrong, says its former consultant". IT World Canada. Retrieved 27 June 2019.

[17] "Medical PDF De-identification: Ensuring Patient Privacy and Compliance in Document Management". 2024.

[nolonger-18] McGuire, A. L.; Gibbs, R. A. (2006). "GENETICS: No Longer De-Identified". Science. 312 (5772): 370–371. doi:10.1126/science.1125339. PMID 16627725.

[Genotype-phenotype-19] Thorisson, G. A.; Muilu, J.; Brookes, A. J. (2009). "Genotype–phenotype databases: Challenges and solutions for the post-genomic era". Nature Reviews Genetics. 10 (1): 9–18. doi:10.1038/nrg2483. hdl:2381/4584. PMID 19065136. S2CID 5964522.

[20] Homer, N.; Szelinger, S.; Redman, M.; Duggan, D.; Tembe, W.; Muehling, J.; Pearson, J. V.; Stephan, D. A.; Nelson, S. F.; Craig, D. W. (2008). Visscher, Peter M. (ed.). "Resolving Individuals Contributing Trace Amounts of DNA to Highly Complex Mixtures Using High-Density SNP Genotyping Microarrays". PLOS Genetics. 4 (8): e1000167. doi:10.1371/journal.pgen.1000167. PMC 2516199. PMID 18769715.

[21] PCAST. "Report to the President - Big Data and Privacy: A technological perspective" (PDF). Office of Science and Technology Policy. Retrieved 28 March 2016 – via National Archives.

[22] "De-Identification 201". Privacy Analytics. 2015.

[23] 45 CFR 164.512)

[1]

[2]

[3]

[4]

[5]

[6]

[7]

[8]

[9]

[10]

[11]

[12]

[13]

[14]

[15]

[16]

[17]

[18]

[19]

[20]

[21]

[22]

[23]