Large-scale matching algorithm for linking biomedical data warehouse records with the national mortality database in France (Preprint)-Reference-Cited by-同舟云学术

Large-scale matching algorithm for linking biomedical data warehouse records with the national mortality database in France (Preprint)

Published:2022-01-21 Issue: Volume: Page:
ISSN:
Container-title:
language:
Short-container-title:

Author:

Guardiolle Vianney^ORCID,Bazoge Adrien^ORCID,Morin Emmanuel^ORCID,Daille Béatrice^ORCID,Toublant Delphine^ORCID,Bouzillé Guillaume^ORCID,Merel Youenn^ORCID,Pierre-Jean Morgane^ORCID,Filiot Alexandre^ORCID,Cuggia Marc^ORCID,Wargny Matthieu^ORCID,Lamer Antoine^ORCID,Gourraud Pierre-Antoine^ORCID

Abstract

BACKGROUND

Often missing or uncertain in biomedical data warehouse (BDW), vital status after discharge is central to the value of BDW for medical research. The French national mortality database (FNMD) offers open-source nominative records of every death. Matching large scale BDWs records with the FNMD combines multiple challenges: the absence of unique common identifier between the two databases, names changing over life, clerical errors and the exponential growth of the number of comparisons to compute.

OBJECTIVE

We aimed to develop a new algorithm for matching BDW records to the FNMD and evaluated its performances.

METHODS

We developed a deterministic algorithm based (i) on advanced data cleaning and knowledge of the naming system and (ii) on the Damerau-Levenshtein Distance (DLD). The algorithm performance was independently assessed in three university hospitals‘BDW data: Lille, Nantes, and Rennes. Specificity was evaluated based on alive subjects on the 1st January 2016, i.e. subjects with at least one hospital encounter before and after this date. Sensitivity was evaluated with subjects recorded as deceased between 1 January 2001 and 31 December 2020. DLD based algorithm was compared to a direct matching algorithm with minimal data cleaning as reference.

RESULTS

All centers combined, sensitivity was 11% higher for the DLD based algorithm (93.3%, 95% Confidence Interval: [92.8-93.9]) than the direct algorithm (82.7% [81.8-83.6], P=<.001%). Sensitivity was superior for men in two centers (Nantes: 87% [85.1-89] vs 83.6% [81.4- 85.8], P=.006%) and for subjects born in France in all centers (Nantes: 85.8% [84.3 - 87.3] vs 74.6% [72.8 - 76.4], P< .001%). Statistically significant sensitivity differences were observed between centers for sensitivity of the DLD based algorithm (85.3% for Nantes vs 97.3% for Lille and Rennes, P<.001%). Specificity was higher than 98% in all subgroups. Our algorithm was able to match tens of millions of death records from BDW, with parallel computing capabilities and low RAM requirements. The R open source script is available at https://gitlab.com/ricdc/insee-deces.

CONCLUSIONS

Overall, sensitivity/recall was 11% higher using the DLD-based algorithm than the direct algorithm. This shows the importance of advanced data cleaning and knowledge of a naming system through DLD use. Statistically significant differences in sensitivity between groups could be found and must be considered when performing an analysis to avoid differential biases. Our algorithm, originally conceived for linking a BDW with the FNMD, can be used for matching any large scale databases. While matching operations using names are considered as sensitive computational operations, the here-released Inseehop package is easy to run on premise facilitating compliance with cybersecurity local framework. The use of advanced deterministic matching algorithm such as the DLD-based algorithm is an insightful example of combination of open source of external data that improve the usage value of BDWs.

CLINICALTRIAL

Publisher

JMIR Publications Inc.

Reference7 articles.

1. Clinical Data Reuse or Secondary Use: Current Status and Potential Future Progress

2. Validation des hémorragies maternelles codées dans le programme de médicalisation des systèmes d’information (PMSI) par couplage aux données de l’Établissement français du sang (EFS)

3. A Theory for Record Linkage

4. A New Method for Assessing How Sensitivity and Specificity of Linkage Studies Affects Estimation

5. Mid-Term Survival and Risk Factors Associated With Myocardial Injury After Fenestrated and/or Branched Endovascular Aortic Aneurysm Repair