Author:
Ebrahimi Ali,Henriksen Margrethe Bang Høstgaard,Brasen Claus Lohman,Hilberg Ole,Hansen Torben Frøstrup,Jensen Lars Henrik,Peimankar Abdolrahman,Wiil Uffe Kock
Abstract
Abstract
Background
Smoking is a critical risk factor responsible for over eight million annual deaths worldwide. It is essential to obtain information on smoking habits to advance research and implement preventive measures such as screening of high-risk individuals. In most countries, including Denmark, smoking habits are not systematically recorded and at best documented within unstructured free-text segments of electronic health records (EHRs). This would require researchers and clinicians to manually navigate through extensive amounts of unstructured data, which is one of the main reasons that smoking habits are rarely integrated into larger studies. Our aim is to develop machine learning models to classify patients’ smoking status from their EHRs.
Methods
This study proposes an efficient natural language processing (NLP) pipeline capable of classifying patients’ smoking status and providing explanations for the decisions. The proposed NLP pipeline comprises four distinct components, which are; (1) considering preprocessing techniques to address abbreviations, punctuation, and other textual irregularities, (2) four cutting-edge feature extraction techniques, i.e. Embedding, BERT, Word2Vec, and Count Vectorizer, employed to extract the optimal features, (3) utilization of a Stacking-based Ensemble (SE) model and a Convolutional Long Short-Term Memory Neural Network (CNN-LSTM) for the identification of smoking status, and (4) application of a local interpretable model-agnostic explanation to explain the decisions rendered by the detection models. The EHRs of 23,132 patients with suspected lung cancer were collected from the Region of Southern Denmark during the period 1/1/2009-31/12/2018. A medical professional annotated the data into ‘Smoker’ and ‘Non-Smoker’ with further classifications as ‘Active-Smoker’, ‘Former-Smoker’, and ‘Never-Smoker’. Subsequently, the annotated dataset was used for the development of binary and multiclass classification models. An extensive comparison was conducted of the detection performance across various model architectures.
Results
The results of experimental validation confirm the consistency among the models. However, for binary classification, BERT method with CNN-LSTM architecture outperformed other models by achieving precision, recall, and F1-scores between 97% and 99% for both Never-Smokers and Active-Smokers. In multiclass classification, the Embedding technique with CNN-LSTM architecture yielded the most favorable results in class-specific evaluations, with equal performance measures of 97% for Never-Smoker and measures in the range of 86 to 89% for Active-Smoker and 91–92% for Never-Smoker.
Conclusion
Our proposed NLP pipeline achieved a high level of classification performance. In addition, we presented the explanation of the decision made by the best performing detection model. Future work will expand the model’s capabilities to analyze longer notes and a broader range of categories to maximize its utility in further research and screening applications.
Funder
University of Southern Denmark
Publisher
Springer Science and Business Media LLC
Reference40 articles.
1. de Boer AR et al. Data mining to retrieve smoking status from electronic health records in general practice. Eur Hear Journal-Digital Heal, 2022.
2. Roth G, Global Burden of Disease Collaborative Network. Global Burden of Disease Study 2017 (GBD 2017) Results. Seattle, United States: Institute for Health Metrics and Evaluation (IHME), 2018, Lancet, vol. 392, pp. 1736–1788, 2018.
3. Malhotra J, Malvezzi M, Negri E, Vecchia CL, Boffetta P. Risk factors for lung cancer worldwide. Eur Respir J. 2016;48(3):889–902.
4. Lange P et al. Danish Register of chronic obstructive pulmonary disease. Clin Epidemiol, pp. 673–8, 2016.
5. Schmidt M, et al. The Danish health care system and epidemiological research: from health care contacts to database records. Clin Epidemiol. 2019;11:563.