MEDINA – Medical Information Anonymization


→ en français

Table of contents



MEDINA (Medical Information Anonymization) is a Natural Language Processing (NLP) tool conceived to de-identify personal data from clinical records written in French from raw textual file. The tool was developed at LIMSI-CNRS (UPR3251) within the framework of project Akenaton – Automated Knowledge Extraction from medical records iN Association with a Telecardiology Observation Network (ANR-07-TecSan-001).

The tool relies on rules (syntactic patterns implemented as regular expressions) and gazetteer (last name, first name, city, etc.). The current available version integrates free of use gazetteers (from the website or specifically designed for this tool), and a set of rules that were defined to process a corpus of 27,900 clinical records from cardiology.

In its current version, MEDINA deals with information from the following categories (a configuration file allows to specify the category to process):

Keep in mind that this tool provides help to perform de-identification and that unprocessed data could remain at the end of the process (our last experiments showed that 83% of personal data are processed). A human curator should check the results and produce the missing de-identifications.




MEDINA is composed of a few scripts written in PERL (Practical Extraction and Report Language) and runs on the command lines (tested on Mac OS X and Unix). Two main steps are followed:


 Documentation (in French)



MEDINA is freely available after signature of a licence. First, contact us to obtain the licence to be filled out. Second, we ask our partnership service to valid this licence. Third, you sign it.



Modification done after each step are in red.



The following tables show results achieved by MEDINA on a corpus of 62 cardiology clinical records.

The first table gives overal results. The confident interval was computed on the F-measure using a Monte Carlo simulation and allows to assess the results that would achieved on a corpus of ten millions of clinical records (assuming a similar distribution of properties at large-scale).

True positiveFalse positiveFalse negativeRecallPrecisionF-measureConfident interval

The following table gives results on each category. Nevertheless, it is hard to give sense to results obtained on category faintly represented.

CategoryTrue positiveFalse positiveFalse negativeRecallPrecisionF-measure
Last name18620190,9070,9030,905
First name1012980,9270,7770,845
Zip code8001,0001,0001,000
Serial number1020,3331,0000,500



 Please cite:

 Further studies:

Last modified: Fri Oct 6 17:40:57 CEST 2017