Large-scale matching algorithm for linking biomedical data warehouse records with the national mortality database in France (Preprint)

Author:

Guardiolle VianneyORCID,Bazoge AdrienORCID,Morin EmmanuelORCID,Daille BéatriceORCID,Toublant DelphineORCID,Bouzillé GuillaumeORCID,Merel YouennORCID,Pierre-Jean MorganeORCID,Filiot AlexandreORCID,Cuggia MarcORCID,Wargny MatthieuORCID,Lamer AntoineORCID,Gourraud Pierre-AntoineORCID

Abstract

BACKGROUND

Often missing or uncertain in biomedical data warehouse (BDW), vital status after discharge is central to the value of BDW for medical research. The French national mortality database (FNMD) offers open-source nominative records of every death. Matching large scale BDWs records with the FNMD combines multiple challenges: the absence of unique common identifier between the two databases, names changing over life, clerical errors and the exponential growth of the number of comparisons to compute.

OBJECTIVE

We aimed to develop a new algorithm for matching BDW records to the FNMD and evaluated its performances.

METHODS

We developed a deterministic algorithm based (i) on advanced data cleaning and knowledge of the naming system and (ii) on the Damerau-Levenshtein Distance (DLD). The algorithm performance was independently assessed in three university hospitals‘BDW data: Lille, Nantes, and Rennes. Specificity was evaluated based on alive subjects on the 1st January 2016, i.e. subjects with at least one hospital encounter before and after this date. Sensitivity was evaluated with subjects recorded as deceased between 1 January 2001 and 31 December 2020. DLD based algorithm was compared to a direct matching algorithm with minimal data cleaning as reference.

RESULTS

All centers combined, sensitivity was 11% higher for the DLD based algorithm (93.3%, 95% Confidence Interval: [92.8-93.9]) than the direct algorithm (82.7% [81.8-83.6], P=<.001%). Sensitivity was superior for men in two centers (Nantes: 87% [85.1-89] vs 83.6% [81.4- 85.8], P=.006%) and for subjects born in France in all centers (Nantes: 85.8% [84.3 - 87.3] vs 74.6% [72.8 - 76.4], P< .001%). Statistically significant sensitivity differences were observed between centers for sensitivity of the DLD based algorithm (85.3% for Nantes vs 97.3% for Lille and Rennes, P<.001%). Specificity was higher than 98% in all subgroups. Our algorithm was able to match tens of millions of death records from BDW, with parallel computing capabilities and low RAM requirements. The R open source script is available at https://gitlab.com/ricdc/insee-deces.

CONCLUSIONS

Overall, sensitivity/recall was 11% higher using the DLD-based algorithm than the direct algorithm. This shows the importance of advanced data cleaning and knowledge of a naming system through DLD use. Statistically significant differences in sensitivity between groups could be found and must be considered when performing an analysis to avoid differential biases. Our algorithm, originally conceived for linking a BDW with the FNMD, can be used for matching any large scale databases. While matching operations using names are considered as sensitive computational operations, the here-released Inseehop package is easy to run on premise facilitating compliance with cybersecurity local framework. The use of advanced deterministic matching algorithm such as the DLD-based algorithm is an insightful example of combination of open source of external data that improve the usage value of BDWs.

CLINICALTRIAL

Publisher

JMIR Publications Inc.

同舟云学术

1.学者识别学者识别

2.学术分析学术分析

3.人才评估人才评估

"同舟云学术"是以全球学者为主线,采集、加工和组织学术论文而形成的新型学术文献查询和分析系统,可以对全球学者进行文献检索和人才价值评估。用户可以通过关注某些学科领域的顶尖人物而持续追踪该领域的学科进展和研究前沿。经过近期的数据扩容,当前同舟云学术共收录了国内外主流学术期刊6万余种,收集的期刊论文及会议论文总量共计约1.5亿篇,并以每天添加12000余篇中外论文的速度递增。我们也可以为用户提供个性化、定制化的学者数据。欢迎来电咨询!咨询电话:010-8811{复制后删除}0370

www.globalauthorid.com

TOP

Copyright © 2019-2024 北京同舟云网络信息技术有限公司
京公网安备11010802033243号  京ICP备18003416号-3