MEDINA Medical Information Anonymization
→ en français
Table of contents
Presentation
MEDINA (Medical Information Anonymization) is a Natural Language Processing (NLP) tool conceived to de-identify personal data from clinical records written in French from raw textual file. The tool was developed at LIMSI-CNRS (UPR3251) within the framework of project Akenaton Automated Knowledge Extraction from medical records iN Association with a Telecardiology Observation Network (ANR-07-TecSan-001).
The tool relies on rules (syntactic patterns implemented as regular expressions) and gazetteer (last name, first name, city, etc.). The current available version integrates free of use gazetteers (from the abu.cnam.fr/DICO/ website or specifically designed for this tool), and a set of rules that were defined to process a corpus of 27,900 clinical records from cardiology.
In its current version, MEDINA deals with information from the following categories (a configuration file allows to specify the category to process):
- last name and first name (without any distinction between patient, family member, or clinical team member);
- address, zip code, city and hospital name;
- age, date, phone number, social security number, serial number;
- information on cardiological devices (pacemaker trademark and model).
Keep in mind that this tool provides help to perform de-identification and that unprocessed data could remain at the end of the process (our last experiments showed that 83% of personal data are processed). A human curator should check the results and produce the missing de-identifications.
Contact:
Utilisation
MEDINA is composed of a few scripts written in PERL (Practical Extraction and Report Language) and runs on the command lines (tested on Mac OS X and Unix). Two main steps are followed:
- identification and tagging of data from each category;
- de-identification of previously tagged data:
either concealing data with SGML tag (hyperonym): «M. <first-name /> <last-name /> est revenu dans le service ce <date /> pour un suivi...»;
either concealing last name and first name with pseudonyms (among the most used last names and first names in France) and predating all dates from a record (the same number of days is substracted from all dates from a record, randomly and distinct number for all documents). Those post-treatments allow to keep a likely appearance while securing the anonymity of patients.
Documentation (in French)
- Guidelines: corpus de-identification (guidelines used by the human annotators to produce the gold standard), update 09/24/2013;
- User guide: MEDINA (quick user guide), update 01/11/2014.
Download
MEDINA is freely available after signature of a licence. First, contact us to obtain the licence to be filled out. Second, we ask our partnership service to valid this licence. Third, you sign it.
Example
Modification done after each step are in red.
- Original text
- Information identification (command: perl 1k_balisage.pl -r directory/ -e txt)
- Predatation (command: perl 2_antidatation.pl -r directory/ -n 941; subtract 941 days from each date)
- Pseudonymization (command: perl 4_pseudonymes.pl -r directory/ -e dat)
- Hyperonymization (command: perl 5_hyperonymes.pl -r directory/ -e pse)
Cher confrère, merci de nous avoir adressé Madame Dupont Marie né(e) le 19/01/1981 à Paris pour réalisation d'une scintigraphie myocardique au Mibi, examen le 5 janvier 2003.
Cher confrère, merci de nous avoir adressé Madame <last-name>Dupont</last-name> <first-name>Marie</first-name> né(e) le <date>19/01/1981</date> à <city>Paris</city> pour réalisation d'une scintigraphie myocardique au Mibi, examen le <date>5 janvier 2003</date>.
Cher confrère, merci de nous avoir adressé Madame <last-name>Dupont</last-name> <first-name>Marie</first-name> né(e) le <date>23/06/1978</date> à <city>Paris</city> pour réalisation d'une scintigraphie myocardique au Mibi, examen le <date>8 juin 2000</date>.
Cher confrère, merci de nous avoir adressé Madame Bernard Camille né(e) le <date>23/06/1978</date> à <city>Paris</city> pour réalisation d'une scintigraphie myocardique au Mibi, examen le <date>8 juin 2000</date>.
Cher confrère, merci de nous avoir adressé Madame Bernard Camille né(e) le 23/06/1978 à <city/> pour réalisation d'une scintigraphie myocardique au Mibi, examen le 8 juin 2000.
Evaluation
The following tables show results achieved by MEDINA on a corpus of 62 cardiology clinical records.
The first table gives overal results. The confident interval was computed on the F-measure using a Monte Carlo simulation and allows to assess the results that would achieved on a corpus of ten millions of clinical records (assuming a similar distribution of properties at large-scale).
True positive | False positive | False negative | Recall | Precision | F-measure | Confident interval |
548 | 87 | 110 | 0,8328 | 0,8630 | 0,8476 | [0,8266;0,8687] |
The following table gives results on each category. Nevertheless, it is hard to give sense to results obtained on category faintly represented.
Category | True positive | False positive | False negative | Recall | Precision | F-measure |
Date | 213 | 13 | 29 | 0,880 | 0,942 | 0,910 |
Last name | 186 | 20 | 19 | 0,907 | 0,903 | 0,905 |
First name | 101 | 29 | 8 | 0,927 | 0,777 | 0,845 |
Hospital | 16 | 16 | 27 | 0,372 | 0,500 | 0,427 |
City | 11 | 5 | 11 | 0,500 | 0,688 | 0,579 |
Zip code | 8 | 0 | 0 | 1,000 | 1,000 | 1,000 |
Adress | 1 | 2 | 7 | 0,125 | 0,333 | 0,182 |
Phone | 8 | 0 | 0 | 1,000 | 1,000 | 1,000 |
Device | 3 | 2 | 7 | 0,300 | 0,600 | 0,400 |
Serial number | 1 | 0 | 2 | 0,333 | 1,000 | 0,500 |
References
Please cite:
- Grouin C. Anonymisation de documents cliniques : performances et limites des méthodes symboliques et par apprentissage statistique. Thèse de Doctorat de l'Université Pierre et Marie Curie (Paris VI), spécialité « informatique biomédicale ». oai:tel.archives-ouvertes.fr:tel-00848672
→ this work presents the different de-identification experiments (rule-based system, machine-learning approach, combination) we made for French, on cardiology clinical records. See chapter 5 for a description of MEDINA working (pp. 14652), appendix B (pp. 21317) for the user guide and appendix A (pp. 20711) for the guidelines.
@PHDTHESIS{grouin2013phd,
author = {Cyril Grouin},
title = {Anonymisation de documents cliniques~: performances et limites des m\'ethodes symboliques et par apprentissage statistique},
school = {Universit\'e Pierre et Marie Curie},
year = {2013},
type = {Th\`ese de Doctorat},
address = {Paris, France},
month = {Juin},
url = {http://tel.archives-ouvertes.fr/tel-00848672}
}
Further studies:
- 2009
- Grouin C, Rosier A, Dameron O, Zweigenbaum P. Testing tactics to localize de-identification. Stud Health Technol Inform Proc of MIE. 2009;150:7359. Sarajevo, Bosnia and Herzegovina. doi: 10.3233/978-1-60750-044-5-735. PubMed ID: 19745408.
→ first experiments (2009) done using the rule-based system: (1) exact match between the hospital patient information system and the content of clinical records and (2) MEDINA processing. - Grouin C, Rosier A, Dameron O, Zweigenbaum P. Une procédure d'anonymisation a deux niveaux pour créer un corpus de comptes rendus hospitaliers. In: Fieschi M, Staccini P, Bouhaddou O, Lovis C (éditeurs). Risques, technologies de l'information pour les pratiques médicales Actes des JFIM. vol. XVII. Springler-Verlag ; 2009. Nice, France. doi: 10.1007/978-2-287-99305-3_3.
→ similar to the previous paper. - 2011
- Grouin C, Zweigenbaum P. Une approche à plusieurs étapes pour anonymiser des documents médicaux. In: RSTI-RIA, Intelligence Artificielle et santé "Vers quelles applications en médecine ?". 25(4):52549. 2011. Hermès-Lavoisier. doi: 10.3166/RIA.25.525-549.
→ more detailed paper. - 2013
- Grouin C. Guide d'annotation. Anonymisation de comptes rendus cliniques. Notes et documents internes LIMSI no 2013-16. Septembre 2013. 8 pages. download
→ guidelines we used to produce the gold standard corpus. - Grouin C. Perspectives de diffusion et de valorisation d'un logiciel d'anonymisation automatique de documents cliniques. Mémoire de recherche pour l'obtention du Diplôme Universitaire de « Génie Biologique et Médical » de l'Université Pierre et Marie Curie (Paris VI), spécialité "Valorisation de la Recherche Appliquée et de l'Innovation Biomédicale".
→ valorization work to distribute the MEDINA tool to the scientific community. - Grouin C, Zweigenbaum P. Automatic De-Identification of French Clinical Records: Comparison of Rule-Based and Machine-Learning Approaches. Stud Health Technol Inform Proc of MEDINFO, 2013;192(Part 1):47680. Copenhagen, Denmark. IMIA and IOS Press. doi: 10.3233/978-1-61499-289-9-476.
→ comparison of de-identification work performed using either rule-based or machine-learning approaches on two corpora: a cardiology clinical records corpus (for wich the tool was designed) and a small OCRized foetopathology corpus. - Névéol A, Grouin C, Darmoni S, Zweigenbaum P. Désidentification d'un corpus clinique pour le traitement automatique du français. In: Session francophone de MedInfo, 2013. Copenhagen, Denmark (20/08/2013).
→ use of MEDINA on a new French corpus. - 2014
- Grouin C, Névéol A. De-identification of clinical notes in French: Towards a protocol for reference corpus development. J Biomed Inform, 2014. Sous presse. doi: 10.1016/j.jbi.2013.12.014. PubMed ID: 24380818
→ presentation of the experiments we made to define a methodological protocol for de-identification of French clinical records.
Last modified: Fri Oct 6 17:40:57 CEST 2017 http://medina.limsi.fr/