Abstract
AbstractThis paper introduces a novel collection of word embeddings, numerical representations of lexical semantics, in 55 languages, trained on a large corpus of pseudo-conversational speech transcriptions from television shows and movies. The embeddings were trained on the OpenSubtitles corpus using the fastText implementation of the skipgram algorithm. Performance comparable with (and in some cases exceeding) embeddings trained on non-conversational (Wikipedia) text is reported on standard benchmark evaluation datasets. A novel evaluation method of particular relevance to psycholinguists is also introduced: prediction of experimental lexical norms in multiple languages. The models, as well as code for reproducing the models and all analyses reported in this paper (implemented as a user-friendly Python package), are freely available at: https://github.com/jvparidon/subs2vec.
Publisher
Springer Science and Business Media LLC
Subject
General Psychology,Psychology (miscellaneous),Arts and Humanities (miscellaneous),Developmental and Educational Psychology,Experimental and Cognitive Psychology
Reference101 articles.
1. Abella, R. A. S. M., & González-Nosti, M. (2019). Motor content norms for 4,565 verbs in Spanish. Behavior Research Methods, 2019, 1–8. https://doi.org/10.3758/s13428-019-01241-1
2. Al-Rfou, R., Perozzi, B., & Skiena, S. (2013). Polyglot: Distributed Word Representations for Multilingual NLP. arXiv:1307.1662
3. Baker, S., Reichart, R., & Korhonen, A. (2014). An unsupervised model for instance level subcategorization acquisition. In Proceedings of the 2014 conference on empirical methods in natural language processing (EMNLP) (pp. 278–289).
4. Bakhtiar, M., & Weekes, B. (2015). Lexico-semantic effects on word naming in Persian: Does age of acquisition have an effect? Memory and Cognition, 43, 298–313. https://doi.org/10.3758/s13421-014-0472-4
5. Berardi, G., Esuli, A., & Marcheggiani, D (2015). Word embeddings go to Italy: A comparison of models and training datasets. In: Proceedings of the Italian information retrieval workshop.
Cited by
11 articles.
订阅此论文施引文献
订阅此论文施引文献,注册后可以免费订阅5篇论文的施引文献,订阅后可以查看论文全部施引文献