Deduce 3.0.0 is out! It is way more accurate, and faster too. It's fully backward compatible, but some functionality is scheduled for removal, read more about it here: docs/migrating-to-v3
- ✨ Remove sensitive information from clinical text written in Dutch
- 🔍 Rule based logic for detecting e.g. names, locations, institutions, identifiers, phone numbers
- 📐 Useful out of the box, but customization higly recommended
- 🌱 Originally validated in Menger et al. (2017), but further optimized since
❗ Deduce is useful out of the box, but please validate and customize on your own data before using it in a critical environment. Remember that de-identification is almost never perfect, and that clinical text often contains other specific details that can link it to a specific person. Be aware that de-identification should primarily be viewed as a way to mitigate risk of identification, rather than a way to obtain anonymous data.
Currently, deduce
can remove the following types of Protected Health Information (PHI):
- 👤 person names, including prefixes and initials
- 🌎 geographical locations smaller than a country
- 🏥 names of hospitals and healthcare institutions
- 📆 dates (combinations of day, month and year)
- 🎂 ages
- 🔢 BSN numbers
- 🔢 identifiers (7 digits without a specific format, e.g. patient identifiers, AGB, BIG)
- ☎️ phone numbers
- 📧 e-mail addresses
- 🔗 URLs
If you use deduce
, please cite the following paper:
pip install deduce
The basic way to use deduce
, is to pass text to the deidentify
method of a Deduce
object:
from deduce import Deduce
deduce = Deduce()
text = (
"betreft: Jan Jansen, bsn 111222333, patnr 000334433. De patient J. Jansen is 64 jaar oud en woonachtig in "
"Utrecht. Hij werd op 10 oktober 2018 door arts Peter de Visser ontslagen van de kliniek van het UMCU. "
"Voor nazorg kan hij worden bereikt via [email protected] of (06)12345678."
)
doc = deduce.deidentify(text)
The output is available in the Document
object:
from pprint import pprint
pprint(doc.annotations)
AnnotationSet({
Annotation(text="(06)12345678", start_char=272, end_char=284, tag="telefoonnummer"),
Annotation(text="111222333", start_char=25, end_char=34, tag="bsn"),
Annotation(text="Peter de Visser", start_char=153, end_char=168, tag="persoon"),
Annotation(text="[email protected]", start_char=247, end_char=268, tag="email"),
Annotation(text="patient J. Jansen", start_char=56, end_char=73, tag="patient"),
Annotation(text="Jan Jansen", start_char=9, end_char=19, tag="patient"),
Annotation(text="10 oktober 2018", start_char=127, end_char=142, tag="datum"),
Annotation(text="64", start_char=77, end_char=79, tag="leeftijd"),
Annotation(text="000334433", start_char=42, end_char=51, tag="id"),
Annotation(text="Utrecht", start_char=106, end_char=113, tag="locatie"),
Annotation(text="UMCU", start_char=202, end_char=206, tag="instelling"),
})
print(doc.deidentified_text)
"""betreft: [PERSOON-1], bsn [BSN-1], patnr [ID-1]. De [PERSOON-1] is [LEEFTIJD-1] jaar oud en woonachtig in
[LOCATIE-1]. Hij werd op [DATUM-1] door arts [PERSOON-2] ontslagen van de kliniek van het [INSTELLING-1].
Voor nazorg kan hij worden bereikt via [EMAIL-1] of [TELEFOONNUMMER-1]."""
Additionally, if the names of the patient are known, they may be added as metadata
, where they will be picked up by deduce
:
from deduce.person import Person
patient = Person(first_names=["Jan"], initials="JJ", surname="Jansen")
doc = deduce.deidentify(text, metadata={'patient': patient})
print (doc.deidentified_text)
"""betreft: [PATIENT], bsn [BSN-1], patnr [ID-1]. De [PATIENT] is [LEEFTIJD-1] jaar oud en woonachtig in
[LOCATIE-1]. Hij werd op [DATUM-1] door arts [PERSOON-2] ontslagen van de kliniek van het [INSTELLING-1].
Voor nazorg kan hij worden bereikt via [EMAIL-1] of [TELEFOONNUMMER-1]."""
As you can see, adding known names keeps references to [PATIENT]
in text. It also increases recall, as not all known names are contained in the lookup lists.
For most cases the latest version is suitable, but some specific milestones are:
3.0.0
- Many optimizations in accuracy, smaller refactors, further speedups2.0.0
- Major refactor, with speedups, many new options for customizing, functionally very similar to original1.0.8
- Small bugfixes compared to original release1.0.1
- Original release with Menger et al. (2017)
Detailed versioning information is accessible in the changelog.
All documentation, including a more extensive tutorial on using, configuring and modifying deduce
, and its API, is available at: docs/tutorial
For setting up the dev environment and contributing guidelines, see: docs/contributing
- Vincent Menger - Initial work
- Jonathan de Bruin - Code review
- Pablo Mosteiro - Bug fixes, structured annotations
This project is licensed under the GNU General Public License v3.0 - see the LICENSE.md file for details