Linking Biomedical Data Warehouse Records With the National Mortality Database in France: Large-scale Matching Algorithm-Reference-Cited by-同舟云学术

Linking Biomedical Data Warehouse Records With the National Mortality Database in France: Large-scale Matching Algorithm

Published:2022-11-01 Issue:11 Volume:10 Page:e36711
ISSN:2291-9694
Container-title:JMIR Medical Informatics
language:en
Short-container-title:JMIR Med Inform

Author:

Guardiolle Vianney^ORCID,Bazoge Adrien^ORCID,Morin Emmanuel^ORCID,Daille Béatrice^ORCID,Toublant Delphine^ORCID,Bouzillé Guillaume^ORCID,Merel Youenn^ORCID,Pierre-Jean Morgane^ORCID,Filiot Alexandre^ORCID,Cuggia Marc^ORCID,Wargny Matthieu^ORCID,Lamer Antoine^ORCID,Gourraud Pierre-Antoine^ORCID

Abstract

Background Often missing from or uncertain in a biomedical data warehouse (BDW), vital status after discharge is central to the value of a BDW in medical research. The French National Mortality Database (FNMD) offers open-source nominative records of every death. Matching large-scale BDWs records with the FNMD combines multiple challenges: absence of unique common identifiers between the 2 databases, names changing over life, clerical errors, and the exponential growth of the number of comparisons to compute. Objective We aimed to develop a new algorithm for matching BDW records to the FNMD and evaluated its performance. Methods We developed a deterministic algorithm based on advanced data cleaning and knowledge of the naming system and the Damerau-Levenshtein distance (DLD). The algorithm’s performance was independently assessed using BDW data of 3 university hospitals: Lille, Nantes, and Rennes. Specificity was evaluated with living patients on January 1, 2016 (ie, patients with at least 1 hospital encounter before and after this date). Sensitivity was evaluated with patients recorded as deceased between January 1, 2001, and December 31, 2020. The DLD-based algorithm was compared to a direct matching algorithm with minimal data cleaning as a reference. Results All centers combined, sensitivity was 11% higher for the DLD-based algorithm (93.3%, 95% CI 92.8-93.9) than for the direct algorithm (82.7%, 95% CI 81.8-83.6; P<.001). Sensitivity was superior for men at 2 centers (Nantes: 87%, 95% CI 85.1-89 vs 83.6%, 95% CI 81.4-85.8; P=.006; Rennes: 98.6%, 95% CI 98.1-99.2 vs 96%, 95% CI 94.9-97.1; P<.001) and for patients born in France at all centers (Nantes: 85.8%, 95% CI 84.3-87.3 vs 74.9%, 95% CI 72.8-77.0; P<.001). The DLD-based algorithm revealed significant differences in sensitivity among centers (Nantes, 85.3% vs Lille and Rennes, 97.3%, P<.001). Specificity was >98% in all subgroups. Our algorithm matched tens of millions of death records from BDWs, with parallel computing capabilities and low RAM requirements. We used the Inseehop open-source R script for this measurement. Conclusions Overall, sensitivity/recall was 11% higher using the DLD-based algorithm than that using the direct algorithm. This shows the importance of advanced data cleaning and knowledge of a naming system through DLD use. Statistically significant differences in sensitivity between groups could be found and must be considered when performing an analysis to avoid differential biases. Our algorithm, originally conceived for linking a BDW with the FNMD, can be used to match any large-scale databases. While matching operations using names are considered sensitive computational operations, the Inseehop package released here is easy to run on premises, thereby facilitating compliance with cybersecurity local framework. The use of an advanced deterministic matching algorithm such as the DLD-based algorithm is an insightful example of combining open-source external data to improve the usage value of BDWs.

Publisher

JMIR Publications Inc.

Subject

Health Information Management,Health Informatics

Reference16 articles.

1. Clinical Data Reuse or Secondary Use: Current Status and Potential Future Progress

2. Répertoire national d'identification des personnes physiquesDocumentation du SNDS201910232022-01-18https://documentation-snds.health-data-hub.fr/glossaire/rnipp.html#contenu

3. Comission d'accès aux documents administratifsAvis 20182992 - Séance du 17/05/2019Avis de la comission d'accès aux documents administratifs201905172022-01-18https://www.cada.fr/20182992

Cited by 5 articles. 订阅此论文施引文献订阅此论文施引文献，注册后可以免费订阅5篇论文的施引文献，订阅后可以查看论文全部施引文献

1. Ensuring GDPR Compliance and Security in a Clinical Data Warehouse: Challenges and Insights from a University Hospital (Preprint);2024-07-01

2. Implementing a Biomedical Data Warehouse From Blueprint to Bedside in a Regional French University Hospital Setting: Unveiling Processes, Overcoming Challenges, and Extracting Clinical Insight;JMIR Medical Informatics;2024-06-24

3. Implementing a Biomedical Data Warehouse From Blueprint to Bedside in a Regional French University Hospital Setting: Unveiling Processes, Overcoming Challenges, and Extracting Clinical Insight;JMIR MED INF;2024

4. Phenotyping of heart failure with preserved ejection faction using electronic health records and echocardiography;European Heart Journal Open;2023-12-14

5. Digital health and care: emerging from pandemic times;BMJ Health & Care Informatics;2023-10