Evaluating large language models for health-related text classification tasks with public social media data-Reference-Cited by-同舟云学术

Evaluating large language models for health-related text classification tasks with public social media data

Published:2024-08-09 Issue: Volume: Page:
ISSN:1067-5027
Container-title:Journal of the American Medical Informatics Association
language:en
Short-container-title:

Author:

Guo Yuting¹,Ovadje Anthony²,Al-Garadi Mohammed Ali³,Sarker Abeed¹²^ORCID

Affiliation:

1. Department of Biomedical Informatics, Emory University , Atlanta, GA 30322, United States

2. Department of Biomedical Engineering, Georgia Institute of Technology , Atlanta, GA 30332, United States

3. Department of Biomedical Informatics, Vanderbilt University , Nashville, TN 37235, United States

Abstract

Abstract Objectives Large language models (LLMs) have demonstrated remarkable success in natural language processing (NLP) tasks. This study aimed to evaluate their performances on social media-based health-related text classification tasks. Materials and Methods We benchmarked 1 Support Vector Machine (SVM), 3 supervised pretrained language models (PLMs), and 2 LLMs-based classifiers across 6 text classification tasks. We developed 3 approaches for leveraging LLMs: employing LLMs as zero-shot classifiers, using LLMs as data annotators, and utilizing LLMs with few-shot examples for data augmentation. Results Across all tasks, the mean (SD) F1 score differences for RoBERTa, BERTweet, and SocBERT trained on human-annotated data were 0.24 (±0.10), 0.25 (±0.11), and 0.23 (±0.11), respectively, compared to those trained on the data annotated using GPT3.5, and were 0.16 (±0.07), 0.16 (±0.08), and 0.14 (±0.08) using GPT4, respectively. The GPT3.5 and GPT4 zero-shot classifiers outperformed SVMs in a single task and in 5 out of 6 tasks, respectively. When leveraging LLMs for data augmentation, the RoBERTa models trained on GPT4-augmented data demonstrated superior or comparable performance compared to those trained on human-annotated data alone. Discussion The results revealed that using LLM-annotated data only for training supervised classification models was ineffective. However, employing the LLM as a zero-shot classifier exhibited the potential to outperform traditional SVM models and achieved a higher recall than the advanced transformer-based model RoBERTa. Additionally, our results indicated that utilizing GPT3.5 for data augmentation could potentially harm model performance. In contrast, data augmentation with GPT4 demonstrated improved model performances, showcasing the potential of LLMs in reducing the need for extensive training data. Conclusions By leveraging the data augmentation strategy, we can harness the power of LLMs to develop smaller, more effective domain-specific NLP models. Using LLM-annotated data without human guidance for training lightweight supervised classification models is an ineffective strategy. However, LLM, as a zero-shot classifier, shows promise in excluding false negatives and potentially reducing the human effort required for data annotation.

Funder

National Institute on Drug Abuse

NIDA

National Institutes of Health

NIH

Publisher

Oxford University Press (OUP)

Link

https://academic.oup.com/jamia/advance-article-pdf/doi/10.1093/jamia/ocae210/58790015/ocae210.pdf

Reference42 articles.

1. Social media use for health purposes: systematic review;Chen;J Med Internet Res,2021

2. Overview of the 8th Social Media Mining for Health Applications (#SMM4H) shared tasks at the AMIA 2023 Annual Symposium;Klein;J Am Med Inform Assoc,2024

3. Mining social media data for biomedical signals and health-related behavior;Correia;Annu Rev Biomed Data Sci,2020