Affiliation:
1. Department of Biomedical Informatics, Emory University , Atlanta, GA 30322, United States
2. Department of Biomedical Engineering, Georgia Institute of Technology , Atlanta, GA 30332, United States
3. Department of Biomedical Informatics, Vanderbilt University , Nashville, TN 37235, United States
Abstract
Abstract
Objectives
Large language models (LLMs) have demonstrated remarkable success in natural language processing (NLP) tasks. This study aimed to evaluate their performances on social media-based health-related text classification tasks.
Materials and Methods
We benchmarked 1 Support Vector Machine (SVM), 3 supervised pretrained language models (PLMs), and 2 LLMs-based classifiers across 6 text classification tasks. We developed 3 approaches for leveraging LLMs: employing LLMs as zero-shot classifiers, using LLMs as data annotators, and utilizing LLMs with few-shot examples for data augmentation.
Results
Across all tasks, the mean (SD) F1 score differences for RoBERTa, BERTweet, and SocBERT trained on human-annotated data were 0.24 (±0.10), 0.25 (±0.11), and 0.23 (±0.11), respectively, compared to those trained on the data annotated using GPT3.5, and were 0.16 (±0.07), 0.16 (±0.08), and 0.14 (±0.08) using GPT4, respectively. The GPT3.5 and GPT4 zero-shot classifiers outperformed SVMs in a single task and in 5 out of 6 tasks, respectively. When leveraging LLMs for data augmentation, the RoBERTa models trained on GPT4-augmented data demonstrated superior or comparable performance compared to those trained on human-annotated data alone.
Discussion
The results revealed that using LLM-annotated data only for training supervised classification models was ineffective. However, employing the LLM as a zero-shot classifier exhibited the potential to outperform traditional SVM models and achieved a higher recall than the advanced transformer-based model RoBERTa. Additionally, our results indicated that utilizing GPT3.5 for data augmentation could potentially harm model performance. In contrast, data augmentation with GPT4 demonstrated improved model performances, showcasing the potential of LLMs in reducing the need for extensive training data.
Conclusions
By leveraging the data augmentation strategy, we can harness the power of LLMs to develop smaller, more effective domain-specific NLP models. Using LLM-annotated data without human guidance for training lightweight supervised classification models is an ineffective strategy. However, LLM, as a zero-shot classifier, shows promise in excluding false negatives and potentially reducing the human effort required for data annotation.
Funder
National Institute on Drug Abuse
NIDA
National Institutes of Health
NIH
Publisher
Oxford University Press (OUP)