Abstract
AbstractText embedding models from Natural Language Processing can map text data (e.g. words, sentences, documents) to meaningful numerical representations (a.k.a. text embeddings). While such models are increasingly applied in social science research, one important issue is often not addressed: the extent to which these embeddings are high-quality representations of the information needed to be encoded. We view this quality evaluation problem from a measurement validity perspective, and propose the use of the classic construct validity framework to evaluate the quality of text embeddings. First, we describe how this framework can be adapted to the opaque and high-dimensional nature of text embeddings. Second, we apply our adapted framework to an example where we compare the validity of survey question representation across text embedding models.
Funder
Nederlandse Organisatie voor Wetenschappelijk Onderzoek
Publisher
Springer Science and Business Media LLC
Subject
Computational Mathematics,Computer Science Applications,Modeling and Simulation
Reference64 articles.
1. Advances in neural information processing systems;T Mikolov,2013
2. Reimers N, Gurevych I (2019) Sentence-BERT: sentence embeddings using Siamese BERT-networks. In: Proceedings of the 2019 conference on empirical methods in natural language processing and the 9th international joint conference on natural language processing (EMNLP-IJCNLP). Association for Computational Linguistics, Hong Kong, pp 3982–3992
3. Bojanowski P, Grave E, Joulin A, Mikolov T (2017) Enriching word vectors with subword information. Trans Assoc Comput Linguist 5:135–146
4. Devlin J, Chang M-W, Lee K, Toutanova K (2019) BERT: pre-training of deep bidirectional transformers for language understanding. Association for Computational Linguistics, Minneapolis, pp 4171–4186
5. Vu H, Abdurahman S, Bhatia S, Ungar L (2020) Predicting responses to psychological questionnaires from participants’ social media posts and question text embeddings. In: Findings of the association for computational linguistics: EMNLP 2020. Association for Computational Linguistics, pp 1512–1524, Online
Cited by
5 articles.
订阅此论文施引文献
订阅此论文施引文献,注册后可以免费订阅5篇论文的施引文献,订阅后可以查看论文全部施引文献