MEDINA – Medical Information Anonymization

Presentation
Utilisation
Documentation
Download
Example
Evaluation
References

Presentation

MEDINA (Medical Information Anonymization) is a Natural Language Processing (NLP) tool conceived to de-identify personal data from clinical records written in French from raw textual file. The tool was developed at LIMSI-CNRS (UPR3251) within the framework of project Akenaton – Automated Knowledge Extraction from medical records iN Association with a Telecardiology Observation Network (ANR-07-TecSan-001).

The tool relies on rules (syntactic patterns implemented as regular expressions) and gazetteer (last name, first name, city, etc.). The current available version integrates free of use gazetteers (from the abu.cnam.fr/DICO/ website or specifically designed for this tool), and a set of rules that were defined to process a corpus of 27,900 clinical records from cardiology.

In its current version, MEDINA deals with information from the following categories (a configuration file allows to specify the category to process):

last name and first name (without any distinction between patient, family member, or clinical team member);
address, zip code, city and hospital name;
age, date, phone number, social security number, serial number;
information on cardiological devices (pacemaker trademark and model).

Keep in mind that this tool provides help to perform de-identification and that unprocessed data could remain at the end of the process (our last experiments showed that 83% of personal data are processed). A human curator should check the results and produce the missing de-identifications.

Contact:

Utilisation

MEDINA is composed of a few scripts written in PERL (Practical Extraction and Report Language) and runs on the command lines (tested on Mac OS X and Unix). Two main steps are followed:

identification and tagging of data from each category;
de-identification of previously tagged data:

either concealing data with SGML tag (hyperonym): «M. <first-name /> <last-name /> est revenu dans le service ce <date /> pour un suivi...»;
either concealing last name and first name with pseudonyms (among the most used last names and first names in France) and predating all dates from a record (the same number of days is substracted from all dates from a record, randomly and distinct number for all documents). Those post-treatments allow to keep a likely appearance while securing the anonymity of patients.

Documentation (in French)

Guidelines: corpus de-identification (guidelines used by the human annotators to produce the gold standard), update 09/24/2013;
User guide: MEDINA (quick user guide), update 01/11/2014.

Download

MEDINA is freely available after signature of a licence. First, contact us to obtain the licence to be filled out. Second, we ask our partnership service to valid this licence. Third, you sign it.

Example

Modification done after each step are in red.

Original text

Cher confrère, merci de nous avoir adressé Madame Dupont Marie né(e) le 19/01/1981 à Paris pour réalisation d'une scintigraphie myocardique au Mibi, examen le 5 janvier 2003.

Information identification (command: perl 1k_balisage.pl -r directory/ -e txt)

Cher confrère, merci de nous avoir adressé Madame <last-name>Dupont</last-name> <first-name>Marie</first-name> né(e) le <date>19/01/1981</date> à <city>Paris</city> pour réalisation d'une scintigraphie myocardique au Mibi, examen le <date>5 janvier 2003</date>.

Predatation (command: perl 2_antidatation.pl -r directory/ -n 941; subtract 941 days from each date)

Cher confrère, merci de nous avoir adressé Madame <last-name>Dupont</last-name> <first-name>Marie</first-name> né(e) le <date>23/06/1978</date> à <city>Paris</city> pour réalisation d'une scintigraphie myocardique au Mibi, examen le <date>8 juin 2000</date>.

Pseudonymization (command: perl 4_pseudonymes.pl -r directory/ -e dat)

Cher confrère, merci de nous avoir adressé Madame Bernard Camille né(e) le <date>23/06/1978</date> à <city>Paris</city> pour réalisation d'une scintigraphie myocardique au Mibi, examen le <date>8 juin 2000</date>.

Hyperonymization (command: perl 5_hyperonymes.pl -r directory/ -e pse)

Cher confrère, merci de nous avoir adressé Madame Bernard Camille né(e) le 23/06/1978 à <city/> pour réalisation d'une scintigraphie myocardique au Mibi, examen le 8 juin 2000.

Evaluation

The following tables show results achieved by MEDINA on a corpus of 62 cardiology clinical records.

The first table gives overal results. The confident interval was computed on the F-measure using a Monte Carlo simulation and allows to assess the results that would achieved on a corpus of ten millions of clinical records (assuming a similar distribution of properties at large-scale).

True positive False positive False negative Recall Precision F-measure Confident interval

548 87 110 0,8328 0,8630 0,8476 [0,8266;0,8687]

The following table gives results on each category. Nevertheless, it is hard to give sense to results obtained on category faintly represented.

Category True positive False positive False negative Recall Precision F-measure

Date 213 13 29 0,880 0,942 0,910

Last name 186 20 19 0,907 0,903 0,905

First name 101 29 8 0,927 0,777 0,845

Hospital 16 16 27 0,372 0,500 0,427

City 11 5 11 0,500 0,688 0,579

Zip code 8 0 0 1,000 1,000 1,000

Adress 1 2 7 0,125 0,333 0,182

Phone 8 0 0 1,000 1,000 1,000

Device 3 2 7 0,300 0,600 0,400

Serial number 1 0 2 0,333 1,000 0,500

References

Please cite:

Grouin C. Anonymisation de documents cliniques : performances et limites des méthodes symboliques et par apprentissage statistique. Thèse de Doctorat de l'Université Pierre et Marie Curie (Paris VI), spécialité « informatique biomédicale ». oai:tel.archives-ouvertes.fr:tel-00848672
→ this work presents the different de-identification experiments (rule-based system, machine-learning approach, combination) we made for French, on cardiology clinical records. See chapter 5 for a description of MEDINA working (pp. 146–52), appendix B (pp. 213–17) for the user guide and appendix A (pp. 207–11) for the guidelines.

@PHDTHESIS{grouin2013phd, author = {Cyril Grouin}, title = {Anonymisation de documents cliniques~: performances et limites des m\'ethodes symboliques et par apprentissage statistique}, school = {Universit\'e Pierre et Marie Curie}, year = {2013}, type = {Th\`ese de Doctorat}, address = {Paris, France}, month = {Juin}, url = {http://tel.archives-ouvertes.fr/tel-00848672} }

Further studies:

2009

Grouin C, Rosier A, Dameron O, Zweigenbaum P. Testing tactics to localize de-identification. Stud Health Technol Inform – Proc of MIE. 2009;150:735–9. Sarajevo, Bosnia and Herzegovina. doi: 10.3233/978-1-60750-044-5-735. PubMed ID: 19745408.
→ first experiments (2009) done using the rule-based system: (1) exact match between the hospital patient information system and the content of clinical records and (2) MEDINA processing.
Grouin C, Rosier A, Dameron O, Zweigenbaum P. Une procédure d'anonymisation a deux niveaux pour créer un corpus de comptes rendus hospitaliers. In: Fieschi M, Staccini P, Bouhaddou O, Lovis C (éditeurs). Risques, technologies de l'information pour les pratiques médicales – Actes des JFIM. vol. XVII. Springler-Verlag ; 2009. Nice, France. doi: 10.1007/978-2-287-99305-3_3.
→ similar to the previous paper.

2011

Grouin C, Zweigenbaum P. Une approche à plusieurs étapes pour anonymiser des documents médicaux. In: RSTI-RIA, Intelligence Artificielle et santé "Vers quelles applications en médecine ?". 25(4):525–49. 2011. Hermès-Lavoisier. doi: 10.3166/RIA.25.525-549.
→ more detailed paper.

2013

Grouin C. Guide d'annotation. Anonymisation de comptes rendus cliniques. Notes et documents internes LIMSI n^o 2013-16. Septembre 2013. 8 pages. download
→ guidelines we used to produce the gold standard corpus.
Grouin C. Perspectives de diffusion et de valorisation d'un logiciel d'anonymisation automatique de documents cliniques. Mémoire de recherche pour l'obtention du Diplôme Universitaire de « Génie Biologique et Médical » de l'Université Pierre et Marie Curie (Paris VI), spécialité "Valorisation de la Recherche Appliquée et de l'Innovation Biomédicale".
→ valorization work to distribute the MEDINA tool to the scientific community.
Grouin C, Zweigenbaum P. Automatic De-Identification of French Clinical Records: Comparison of Rule-Based and Machine-Learning Approaches. Stud Health Technol Inform – Proc of MEDINFO, 2013;192(Part 1):476–80. Copenhagen, Denmark. IMIA and IOS Press. doi: 10.3233/978-1-61499-289-9-476.
→ comparison of de-identification work performed using either rule-based or machine-learning approaches on two corpora: a cardiology clinical records corpus (for wich the tool was designed) and a small OCRized foetopathology corpus.
Névéol A, Grouin C, Darmoni S, Zweigenbaum P. Désidentification d'un corpus clinique pour le traitement automatique du français. In: Session francophone de MedInfo, 2013. Copenhagen, Denmark (20/08/2013).
→ use of MEDINA on a new French corpus.

2014

Grouin C, Névéol A. De-identification of clinical notes in French: Towards a protocol for reference corpus development. J Biomed Inform, 2014. Sous presse. doi: 10.1016/j.jbi.2013.12.014. PubMed ID: 24380818
→ presentation of the experiments we made to define a methodological protocol for de-identification of French clinical records.

Last modified: Fri Oct 6 17:40:57 CEST 2017 — http://medina.limsi.fr/

True positive	False positive	False negative	Recall	Precision	F-measure	Confident interval
548	87	110	0,8328	0,8630	0,8476	[0,8266;0,8687]

Category	True positive	False positive	False negative	Recall	Precision	F-measure
Date	213	13	29	0,880	0,942	0,910
Last name	186	20	19	0,907	0,903	0,905
First name	101	29	8	0,927	0,777	0,845
Hospital	16	16	27	0,372	0,500	0,427
City	11	5	11	0,500	0,688	0,579
Zip code	8	0	0	1,000	1,000	1,000
Adress	1	2	7	0,125	0,333	0,182
Phone	8	0	0	1,000	1,000	1,000
Device	3	2	7	0,300	0,600	0,400
Serial number	1	0	2	0,333	1,000	0,500