A comparative review of patient matching algorithm to deduplicate and uniquely identify HIV+ clients (Preprint)

Author:

Antoine Mayer,Kariuki James,de Kerorguen Nicolas,Ojo Adebowale,Yoon Steven,Manders Eric-Jan

Abstract

BACKGROUND

HIV programs in Low- and Middle-Income Countries have committed significant efforts and resources to achieve the UNAIDS goal of ending the HIV epidemic by 2030. HIV programs are mainly using programmatic aggregate data and periodic surveys to describe the HIV care continuum. WHO has recommended a shift to person-centered monitoring which will require HIV programs to link and deduplicate client’s data from multiple sources. Patient demographic data matching is a practical solution that can help HIV programs to effectively link and deduplicate different data systems and uniquely identify HIV patients.

OBJECTIVE

Our aim in this study is to compare seven (7) patient matching algorithms from traditional deterministic and probabilistic techniques to fundamental machine learning algorithms; discuss the advantages and challenges of each approach and provide practical consideration for implementation in the context of public health to support HIV surveillance in LMIC.

METHODS

We conducted an experimental evaluation from a hypothetical case surveillance context using synthetic demographic data modeled on real datasets. We generated three synthetic duplicate demographic datasets with known match status. For each matching approach, we select prominent and commonly used matching algorithms. We used exact matching and a pseudo unique identifier; a weighted average score-based matching and the Expectation-Maximization (EM) algorithm; and three (3) supervised machine learning algorithms: naive Bayes, logistic regression, and support vector machine algorithm. We deduplicated each dataset using the selected algorithms and compared results matching quality using confusion matrix, recall, precision, F score, and numbers of unique records identified. The experiment and algorithm were implemented using the Python Record Linkage Toolkit.

RESULTS

Across the three synthetic datasets, our results show out-of-the-box supervised machine learning classifiers from the Python toolkit outperform traditional, exact matching, pseudo unique identifier, weighted average, and even the EM algorithm. The machine learning algorithms: support vector machine and logistic regression registered an average F score of 94.15% and 94.14% respectively. However, Expectation-Maximization registered an average F score of 91.91%, making it a good candidate to consider as a probabilistic approach. The exact matching and pseudo unique identifier techniques registered the lowest F score with the highest number of false negatives.

CONCLUSIONS

The supervised machine learning algorithms were found to be the most effective in terms of matching quality, but the most complex and difficult to develop, maintain and support in a resource-limited setting. Despite those challenges, an HIV program can use those algorithms to deduplicate and uniquely identify patients with high precision and sensitivity. We recommend that any implementation should consider standardization of demographic data, evaluation of multiple algorithms, and selection of the best one for their dataset and a clear plan to continuously monitor and improve matching quality over time.

Publisher

JMIR Publications Inc.

同舟云学术

1.学者识别学者识别

2.学术分析学术分析

3.人才评估人才评估

"同舟云学术"是以全球学者为主线,采集、加工和组织学术论文而形成的新型学术文献查询和分析系统,可以对全球学者进行文献检索和人才价值评估。用户可以通过关注某些学科领域的顶尖人物而持续追踪该领域的学科进展和研究前沿。经过近期的数据扩容,当前同舟云学术共收录了国内外主流学术期刊6万余种,收集的期刊论文及会议论文总量共计约1.5亿篇,并以每天添加12000余篇中外论文的速度递增。我们也可以为用户提供个性化、定制化的学者数据。欢迎来电咨询!咨询电话:010-8811{复制后删除}0370

www.globalauthorid.com

TOP

Copyright © 2019-2024 北京同舟云网络信息技术有限公司
京公网安备11010802033243号  京ICP备18003416号-3