Abstract
AbstractResearch on language and cognition relies extensively on psycholinguistic datasets or “norms”. These datasets contain judgments of lexical properties like concreteness and age of acquisition, and can be used to norm experimental stimuli, discover empirical relationships in the lexicon, and stress-test computational models. However, collecting human judgments at scale is both time-consuming and expensive. This issue of scale is compounded for multi-dimensional norms and those incorporating context. The current work asks whether large language models (LLMs) can be leveraged to augment the creation of large, psycholinguistic datasets in English. I use GPT-4 to collect multiple kinds of semantic judgments (e.g., word similarity, contextualized sensorimotor associations, iconicity) for English words and compare these judgments against the human “gold standard”. For each dataset, I find that GPT-4’s judgments are positively correlated with human judgments, in some cases rivaling or even exceeding the average inter-annotator agreement displayed by humans. I then identify several ways in which LLM-generated norms differ from human-generated norms systematically. I also perform several “substitution analyses”, which demonstrate that replacing human-generated norms with LLM-generated norms in a statistical model does not change the sign of parameter estimates (though in select cases, there are significant changes to their magnitude). I conclude by discussing the considerations and limitations associated with LLM-generated norms in general, including concerns of data contamination, the choice of LLM, external validity, construct validity, and data quality. Additionally, all of GPT-4’s judgments (over 30,000 in total) are made available online for further analysis.
Publisher
Springer Science and Business Media LLC
Reference94 articles.
1. Aher, G., Arriaga, R. I., & Kalai, A. T. (2022). Using large language models to simulate multiple humans. arXiv preprint arXiv:2208.10264.
2. Anand, P., Chung, S., & Wagers, M. (2020). Widening the net: Challenges for gathering linguistic data in the digital age. Response to NSF SBE.
3. Argyle, L.P., Busby, E.C., Fulda, N., Gubler, J., Rytting, C., & Wingate, D. (2022). Out of one, many: Using language models to simulate human samples. arXiv preprint arXiv:2209.06899.
4. Awad, E., Dsouza, S., Shariff, A., Rahwan, I., & Bonnefon, J. F. (2020). Universals and variations in moral decisions made in 42 countries by 70,000 participants. Proceedings of the National Academy of Sciences, 117(5), 2332–2337.
5. Bender, E. M. (2009, March). Linguistically naïve!= language independent: Why NLP needs linguistic typology. In Proceedings of the EACL 2009 Workshop on the Interaction between Linguistics and Computational Linguistics: Virtuous, Vicious or Vacuous? (pp. 26–32).
Cited by
2 articles.
订阅此论文施引文献
订阅此论文施引文献,注册后可以免费订阅5篇论文的施引文献,订阅后可以查看论文全部施引文献
1. A Scoping Review of Large Language Models: Architecture and Applications;2024 4th International Conference on Innovative Research in Applied Science, Engineering and Technology (IRASET);2024-05-16
2. Large Language Models and the Wisdom of Small Crowds;Open Mind;2024