BACKGROUND
HIV programs in Low- and Middle-Income Countries have committed significant efforts and resources to achieve the UNAIDS goal of ending the HIV epidemic by 2030. HIV programs are mainly using programmatic aggregate data and periodic surveys to describe the HIV care continuum. WHO has recommended a shift to person-centered monitoring which will require HIV programs to link and deduplicate client’s data from multiple sources. Patient demographic data matching is a practical solution that can help HIV programs to effectively link and deduplicate different data systems and uniquely identify HIV patients.
OBJECTIVE
Our aim in this study is to compare seven (7) patient matching algorithms from traditional deterministic and probabilistic techniques to fundamental machine learning algorithms; discuss the advantages and challenges of each approach and provide practical consideration for implementation in the context of public health to support HIV surveillance in LMIC.
METHODS
We conducted an experimental evaluation from a hypothetical case surveillance context using synthetic demographic data modeled on real datasets. We generated three synthetic duplicate demographic datasets with known match status. For each matching approach, we select prominent and commonly used matching algorithms. We used exact matching and a pseudo unique identifier; a weighted average score-based matching and the Expectation-Maximization (EM) algorithm; and three (3) supervised machine learning algorithms: naive Bayes, logistic regression, and support vector machine algorithm. We deduplicated each dataset using the selected algorithms and compared results matching quality using confusion matrix, recall, precision, F score, and numbers of unique records identified. The experiment and algorithm were implemented using the Python Record Linkage Toolkit.
RESULTS
Across the three synthetic datasets, our results show out-of-the-box supervised machine learning classifiers from the Python toolkit outperform traditional, exact matching, pseudo unique identifier, weighted average, and even the EM algorithm. The machine learning algorithms: support vector machine and logistic regression registered an average F score of 94.15% and 94.14% respectively. However, Expectation-Maximization registered an average F score of 91.91%, making it a good candidate to consider as a probabilistic approach. The exact matching and pseudo unique identifier techniques registered the lowest F score with the highest number of false negatives.
CONCLUSIONS
The supervised machine learning algorithms were found to be the most effective in terms of matching quality, but the most complex and difficult to develop, maintain and support in a resource-limited setting. Despite those challenges, an HIV program can use those algorithms to deduplicate and uniquely identify patients with high precision and sensitivity. We recommend that any implementation should consider standardization of demographic data, evaluation of multiple algorithms, and selection of the best one for their dataset and a clear plan to continuously monitor and improve matching quality over time.