Affiliation:
1. Université Paris-Saclay CNRS, LISN 91400, Orsay, France. aina.gari@limsi.fr
2. Department of Digital Humanities University of Helsinki Helsinki, Finland. marianna.apidianaki@helsinki.fi
Abstract
Pre-trained language models (LMs) encode rich information about linguistic structure but their knowledge about lexical polysemy remains unclear. We propose a novel experimental setup for analyzing this knowledge in LMs specifically trained for different languages (English, French, Spanish, and Greek) and in multilingual BERT. We perform our analysis on datasets carefully designed to reflect different sense distributions, and control for parameters that are highly correlated with polysemy such as frequency and grammatical category. We demonstrate that BERT-derived representations reflect words’ polysemy level and their partitionability into senses. Polysemy-related information is more clearly present in English BERT embeddings, but models in other languages also manage to establish relevant distinctions between words at different polysemy levels. Our results contribute to a better understanding of the knowledge encoded in contextualized representations and open up new avenues for multilingual lexical semantics research.
Reference67 articles.
1. Clusterability: A Theoretical Study;Ackerman;Journal of Machine Learning Research,2009
2. Fine- grained analysis of sentence embeddings using auxiliary prediction tasks;Adi,2017
3. Unsupervised WSD based on automatically retrieved examples: The importance of bias;Agirre,2004
4. Putting words in context: LSTM language models and lexical ambiguity;Aina,2019
5. The WaCky Wide Web: A collection of very large linguistically processed web-crawled corpora;Baroni;Journal of Language Resources and Evaluation,2009
Cited by
13 articles.
订阅此论文施引文献
订阅此论文施引文献,注册后可以免费订阅5篇论文的施引文献,订阅后可以查看论文全部施引文献