BACKGROUND
Machine learning models are increasingly being used in healthcare settings. If integrity of the data used to build these models is impacted by cybersecurity attacks, the results of these predictive models become questionable.
OBJECTIVE
To assess the risks associated with false data injections in provider progress notes, and to evaluate the potential of exploiting variance in predictions of text mining methods at detecting such data integrity issues.
METHODS
A simulation of false data injection scenarios was conducted on a set of provider notes. Common statistical text mining (STM) methods were used to assess the mental health severity of patients described in the falsified notes. The simulation experiment focused on 1) assessing the overall classification stability across the different types of false data injections, 2) identifying the classification algorithms that are robust against these attacks, and 3) evaluating the potential of STM methods at signaling data integrity issues.
RESULTS
A simulation experiment using a training dataset of 96 severe psychiatric provider notes, 337 non-severe notes revealed that the performance of classification models drops with false data injection attacks. The accuracy of single models such as support vector machines and decision trees significantly dropped (average drop of 16.41%) by injection of a sole screening template. Ensemble models such as bagging and boosting were robust against sole screening template injections with an average drop of the accuracy level by about 0.513%. The performance of all models dropped significantly when false data exceeded 50% of the size of the note.
CONCLUSIONS
While STM methods can be useful at assessing the severity of the mental health conditions expressed in provider notes, the performance of such models can drop significantly with false data injections. Traditionally, such non-robust behavior of STM models is undesirable. Counter-intuitively, we show here that such a lack of robustness can be leveraged to generate signals of malicious false data injections into EHR systems. Hence prediction variance of these models can potentially be used to signal data integrity issues.