BACKGROUND
Identifying individuals with depressive symptomatology (DS) promptly and effectively is of paramount importance for providing timely treatment. Machine learning models have shown promise in this area, yet studies often fall short in demonstrating the practical benefits of utilizing these models and fail to provide tangible real-world applications.
OBJECTIVE
The objectives of this study were: 1) to establish a novel methodology for identifying individuals likely to exhibit DS; 2) to identify the most influential features in a more explainable way via probabilistic measures; 3) to propose tools that can be used in real-world applications.
METHODS
Three datasets were utilized in this study: the PROACTIVE dataset, along with the Brazilian National Health Survey (PNS) datasets from 2013 and 2019, comprising socio-demographic and health-related features. A Bayesian Network was used for feature selection. Selected features were then employed to train machine learning models to predict DS, operationalized as a score of 10 or higher on the 9-item Patient Health Questionnaire (PHQ-9). Furthermore, an analysis was conducted to evaluate the influence of different sensitivities on the reduction of screening process time achieved through the utilization of the model compared with a random approach.
RESULTS
With a threshold of 0.5, the methodology achieved a sensitivity of 0.640, 0.659, and 0.694, and an area under the receiver operating characteristic curve (AUC) of 0.736, 0.801, and 0.809, for PROACTIVE, PNS 2013, and PNS 2019, respectively. For the PROACTIVE dataset, the most influential features identified were postural balance, shortness of breath, and how old people feel they are. In the PNS 2013 dataset, the features were: the ability to do usual activities, chest pain, sleep problems, and chronic back problems. The PNS 2019 dataset shared three of the most influential features with the PNS 2013 dataset. However, the difference was the replacement of chronic back problems with verbal abuse. It is important to note that the features contained in the PNS datasets differ from those found in the PROACTIVE dataset. An empirical analysis demonstrated that utilizing the proposed model led to a reduction in screening time of up to 52% while maintaining a sensitivity of 0.80.
CONCLUSIONS
This study developed a novel methodology for identifying individuals with DS by demonstrating the practical benefits of employing Bayesian networks to identify the most significant features to be used in a machine learning model for the prediction of DS in three general health and socio-economic datasets. Moreover, simulations indicated that the utilization of this approach has the potential to substantially reduce the time required for identifying people with DS while maintaining a high sensitivity. These findings pave the way for improved early identification and intervention strategies for individuals experiencing depressive symptomatology.