Medical record linkage in health information systems by approximate string matching and clustering-Reference-Cited by-同舟云学术

Medical record linkage in health information systems by approximate string matching and clustering

Published:2005-10-11 Issue:1 Volume:5 Page:
ISSN:1472-6947
Container-title:BMC Medical Informatics and Decision Making
language:en
Short-container-title:BMC Med Inform Decis Mak

Author:

Sauleau Erik A,Paumier Jean-Philippe,Buemi Antoine

Abstract

Abstract Background Multiplication of data sources within heterogeneous healthcare information systems always results in redundant information, split among multiple databases. Our objective is to detect exact and approximate duplicates within identity records, in order to attain a better quality of information and to permit cross-linkage among stand-alone and clustered databases. Furthermore, we need to assist human decision making, by computing a value reflecting identity proximity. Methods The proposed method is in three steps. The first step is to standardise and to index elementary identity fields, using blocking variables, in order to speed up information analysis. The second is to match similar pair records, relying on a global similarity value taken from the Porter-Jaro-Winkler algorithm. And the third is to create clusters of coherent related records, using graph drawing, agglomerative clustering methods and partitioning methods. Results The batch analysis of 300,000 "supposedly" distinct identities isolates 240,000 true unique records, 24,000 duplicates (clusters composed of 2 records) and 3,000 clusters whose size is greater than or equal to 3 records. Conclusion Duplicate-free databases, used in conjunction with relevant indexes and similarity values, allow immediate (i.e.: real-time) proximity detection when inserting a new identity.

Publisher

Springer Science and Business Media LLC

Subject

Health Informatics,Health Policy,Computer Science Applications

Link

http://link.springer.com/content/pdf/10.1186/1472-6947-5-32.pdf

Reference39 articles.

1. Belin TR, Rubin DB: A method for calibrating false match rates in record linkage. Journal of the American Statistical Association. 1995, 90: 697-707.

2. Newcombe HB, Kennedy JM: Record linkage: making maximum use of the discriminating power of identifying information. Communications of the ACM. 1962, 5: 563-566. 10.1145/368996.369026.

3. Vintsyuk T: Speech discrimination by dynamic programming. Cybernetics. 1968, 4: 52-58. 10.1007/BF01074755.

4. Sellers P: The theory and computation of evolutionary distances: pattern recognition. Journal of Algorithms. 1980, 1: 359-373. 10.1016/0196-6774(80)90016-4.

5. Navarro G, Raffinot M: Flexible pattern matching in strings. 2002, Cambridge, Cambridge University Press

Cited by 30 articles. 订阅此论文施引文献订阅此论文施引文献，注册后可以免费订阅5篇论文的施引文献，订阅后可以查看论文全部施引文献

1. Gecko: A Python library for the generation and mutation of realistic personal identification data at scale;SoftwareX;2024-09

2. Clustering Heterogeneous Data Values for Data Quality Analysis;Journal of Data and Information Quality;2023-08-22

3. Understanding the Digital Resilience of Physicians during the COVID-19 Pandemic: An Empirical Study;MIS Quarterly;2023-03-01

4. FIRLA: a Fast Incremental Record Linkage Algorithm;Journal of Biomedical Informatics;2022-06

5. Detecting Quality Problems in Data Models by Clustering Heterogeneous Data Values;2021 ACM/IEEE International Conference on Model Driven Engineering Languages and Systems Companion (MODELS-C);2021-10