Large Language Models for Epidemiological Research via Automated Machine Learning: Case Study Using Data From the British National Child Development Study-Reference-Cited by-同舟云学术

Large Language Models for Epidemiological Research via Automated Machine Learning: Case Study Using Data From the British National Child Development Study

Published:2023-09-19 Issue: Volume:11 Page:e43638-e43638
ISSN:2291-9694
Container-title:JMIR Medical Informatics
language:en
Short-container-title:JMIR Med Inform

Author:

Wibaek Rasmus^ORCID,Andersen Gregers Stig^ORCID,Dahm Christina C^ORCID,Witte Daniel R^ORCID,Hulman Adam^ORCID

Abstract

Abstract Background Large language models have had a huge impact on natural language processing (NLP) in recent years. However, their application in epidemiological research is still limited to the analysis of electronic health records and social media data. Objectives To demonstrate the potential of NLP beyond these domains, we aimed to develop prediction models based on texts collected from an epidemiological cohort and compare their performance to classical regression methods. Methods We used data from the British National Child Development Study, where 10,567 children aged 11 years wrote essays about how they imagined themselves as 25-year-olds. Overall, 15% of the data set was set aside as a test set for performance evaluation. Pretrained language models were fine-tuned using AutoTrain (Hugging Face) to predict current reading comprehension score (range: 0-35) and future BMI and physical activity (active vs inactive) at the age of 33 years. We then compared their predictive performance (accuracy or discrimination) with linear and logistic regression models, including demographic and lifestyle factors of the parents and children from birth to the age of 11 years as predictors. Results NLP clearly outperformed linear regression when predicting reading comprehension scores (root mean square error: 3.89, 95% CI 3.74-4.05 for NLP vs 4.14, 95% CI 3.98-4.30 and 5.41, 95% CI 5.23-5.58 for regression models with and without general ability score as a predictor, respectively). Predictive performance for physical activity was similarly poor for the 2 methods (area under the receiver operating characteristic curve: 0.55, 95% CI 0.52-0.60 for both) but was slightly better than random assignment, whereas linear regression clearly outperformed the NLP approach when predicting BMI (root mean square error: 4.38, 95% CI 4.02-4.74 for NLP vs 3.85, 95% CI 3.54-4.16 for regression). The NLP approach did not perform better than simply assigning the mean BMI from the training set as a predictor. Conclusions Our study demonstrated the potential of using large language models on text collected from epidemiological studies. The performance of the approach appeared to depend on how directly the topic of the text was related to the outcome. Open-ended questions specifically designed to capture certain health concepts and lived experiences in combination with NLP methods should receive more attention in future epidemiological studies.

Publisher

JMIR Publications Inc.

Subject

Health Information Management,Health Informatics

Reference35 articles.

1. Natural language processing: an introduction;Nadkarni;J Am Med Inform Assoc

2. A neural probabilistic language model;Bengio;J Mach Learn Res

3. Essential elements of natural language processing: what the radiologist should know;Chen;Acad Radiol

4. Vaswani A Shazeer N Parmar N Uszkoreit J Jones L Gomez AN et al. Attention is all you need. Presented at: 31st International Conference on Neural Information Processing Systems; June 12, 2017; Long Beach, CAp. 6000-6010. [doi: 10.5555/3295222.3295349]

5. Devlin J Chang MW Lee K Toutanova K. BERT: pre-training of deep Bidirectional transformers for language understanding. Presented at: 2019 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, NAACL-HLT; June 2-7, 2019; Minneapolis, MNp. 4171-4186. [doi: 10.18653/v1/N19-1423]