Abstract
Word embeddings are increasingly being used as a tool to study word associations in specific corpora. However, it is unclear whether such embeddings reflect enduring properties of language or if they are sensitive to inconsequential variations in the source documents. We find that nearest-neighbor distances are highly sensitive to small changes in the training corpus for a variety of algorithms. For all methods, including specific documents in the training set can result in substantial variations. We show that these effects are more prominent for smaller training corpora. We recommend that users never rely on single embedding models for distance calculations, but rather average over multiple bootstrap samples, especially for small corpora.
Subject
Artificial Intelligence,Computer Science Applications,Linguistics and Language,Human-Computer Interaction,Communication
Cited by
77 articles.
订阅此论文施引文献
订阅此论文施引文献,注册后可以免费订阅5篇论文的施引文献,订阅后可以查看论文全部施引文献
1. Using Large Language Models to Understand Leadership Perception and Expectation;2024 IEEE International Conference on Multimedia and Expo Workshops (ICMEW);2024-07-15
2. Bettercall: AI based legal assistant;2024 5th International Conference on Image Processing and Capsule Networks (ICIPCN);2024-07-03
3. TransDrift: Modeling Word-Embedding Drift using Transformer;Companion Proceedings of the ACM Web Conference 2024;2024-05-13
4. Understanding Public Perceptions of AI Conversational Agents: A Cross-Cultural Analysis;Proceedings of the CHI Conference on Human Factors in Computing Systems;2024-05-11
5. Not What it Used to Be: Characterizing Content and User-base Changes in Newly Created Online Communities;Proceedings of the CHI Conference on Human Factors in Computing Systems;2024-05-11