Abstract
AbstractObjectiveAffective disorders are associated with atypical voice patterns; however, automated voice analyses suffer from small sample sizes and untested generalizability on external data. We investigated a generalizable approach to aid clinical evaluation of depression and remission from voice using transfer learning: we train machine learning models on easily accessible non-clinical datasets and test them on novel clinical data in a different language.MethodsA Mixture-of-Experts machine learning model was trained to infer happy/sad emotional state using three publicly available emotional speech corpora in German and US English. We examined the model’s predictive ability to classify the presence of depression on Danish speaking healthy controls (N = 42), patients with first-episode major depressive disorder (MDD) (N = 40), and the subset of the same patients who entered remission (N = 25) based on recorded clinical interviews. The model was evaluated on raw, de-noised, and speaker-diarized data.ResultsThe model showed separation between healthy controls and depressed patients at the first visit, obtaining an AUC of 0.71. Further, speech from patients in remission was indistinguishable from that of the control group. Model predictions were stable throughout the interview, suggesting that 20-30 seconds of speech might be enough to accurately screen a patient. Background noise (but not speaker diarization) heavily impacted predictions.ConclusionA generalizable speech emotion recognition model can effectively reveal changes in speaker depressive states before and after remission in patients with MDD. Data collection settings and data cleaning are crucial when considering automated voice analysis for clinical purposes.Significant outcomes- Using a speech emotion recognition model trained on other languages, we predicted the presence of MDD with an AUC of 0.71.- The speech emotion recognition model could accurately detect changes in voice after patients achieved remission from MDD.- Preprocessing steps, particularly background noise removal, greatly influenced classification performance.Limitations- No data from non-remitters, meaning that changes to voice for that group could not be assessed.- It is unclear how well the model would generalize beyond Germanic languages.Data availability statementDue to the nature of the data (autobiographical interviews in a clinical population), the recordings of the participants cannot be shared publicly. The aggregated model predictions and code used to run the analyses is available at https://github.com/HLasse/SERDepressionDetection.
Publisher
Cold Spring Harbor Laboratory
Cited by
3 articles.
订阅此论文施引文献
订阅此论文施引文献,注册后可以免费订阅5篇论文的施引文献,订阅后可以查看论文全部施引文献