MEDINA – Medical Information Anonymization

 

→ en français

Table of contents

 

 Presentation

MEDINA (Medical Information Anonymization) is a Natural Language Processing (NLP) tool conceived to de-identify personal data from clinical records written in French from raw textual file. The tool was developed at LIMSI-CNRS (UPR3251) within the framework of project Akenaton – Automated Knowledge Extraction from medical records iN Association with a Telecardiology Observation Network (ANR-07-TecSan-001).

The tool relies on rules (syntactic patterns implemented as regular expressions) and gazetteer (last name, first name, city, etc.). The current available version integrates free of use gazetteers (from the abu.cnam.fr/DICO/ website or specifically designed for this tool), and a set of rules that were defined to process a corpus of 27,900 clinical records from cardiology.

In its current version, MEDINA deals with information from the following categories (a configuration file allows to specify the category to process):

Keep in mind that this tool provides help to perform de-identification and that unprocessed data could remain at the end of the process (our last experiments showed that 83% of personal data are processed). A human curator should check the results and produce the missing de-identifications.

Contact:

 

 Utilisation

MEDINA is composed of a few scripts written in PERL (Practical Extraction and Report Language) and runs on the command lines (tested on Mac OS X and Unix). Two main steps are followed:

 

 Documentation (in French)

 

 Download

MEDINA is freely available after signature of a licence. First, contact us to obtain the licence to be filled out. Second, we ask our partnership service to valid this licence. Third, you sign it.

 

 Example

Modification done after each step are in red.

 

 Evaluation

The following tables show results achieved by MEDINA on a corpus of 62 cardiology clinical records.

The first table gives overal results. The confident interval was computed on the F-measure using a Monte Carlo simulation and allows to assess the results that would achieved on a corpus of ten millions of clinical records (assuming a similar distribution of properties at large-scale).

True positiveFalse positiveFalse negativeRecallPrecisionF-measureConfident interval
548871100,83280,86300,8476[0,8266;0,8687]

The following table gives results on each category. Nevertheless, it is hard to give sense to results obtained on category faintly represented.

CategoryTrue positiveFalse positiveFalse negativeRecallPrecisionF-measure
Date21313290,8800,9420,910
Last name18620190,9070,9030,905
First name1012980,9270,7770,845
Hospital1616270,3720,5000,427
City115110,5000,6880,579
Zip code8001,0001,0001,000
Adress1270,1250,3330,182
Phone8001,0001,0001,000
Device3270,3000,6000,400
Serial number1020,3331,0000,500

 

 References

 Please cite:

 Further studies:


Last modified: Fri Oct 6 17:40:57 CEST 2017 http://medina.limsi.fr/