Multi-SimLex: A Large-Scale Evaluation of Multilingual and Crosslingual Lexical Semantic Similarity-Reference-Cited by-同舟云学术

Multi-SimLex: A Large-Scale Evaluation of Multilingual and Crosslingual Lexical Semantic Similarity

Published:2021-02 Issue:4 Volume:46 Page:847-897
ISSN:0891-2017
Container-title:Computational Linguistics
language:en
Short-container-title:Computational Linguistics

Author:

Vulić Ivan¹,Baker Simon¹,Ponti Edoardo Maria¹,Petti Ulla¹,Leviant Ira²,Wing Kelly¹,Majewska Olga¹,Bar Eden²,Malone Matt¹,Poibeau Thierry³,Reichart Roi²,Korhonen Anna¹

Affiliation:

1. Language Technology Lab, University of Cambridge.

2. Faculty of Industrial Engineering and Management, Technion, IIT.

3. LATTICE Lab, CNRS and ENS/PSL and Univ. Sorbonne Nouvelle.

Abstract

We introduce Multi-SimLex, a large-scale lexical resource and evaluation benchmark covering data sets for 12 typologically diverse languages, including major languages (e.g., Mandarin Chinese, Spanish, Russian) as well as less-resourced ones (e.g., Welsh, Kiswahili). Each language data set is annotated for the lexical relation of semantic similarity and contains 1,888 semantically aligned concept pairs, providing a representative coverage of word classes (nouns, verbs, adjectives, adverbs), frequency ranks, similarity intervals, lexical fields, and concreteness levels. Additionally, owing to the alignment of concepts across languages, we provide a suite of 66 crosslingual semantic similarity data sets. Because of its extensive size and language coverage, Multi-SimLex provides entirely novel opportunities for experimental evaluation and analysis. On its monolingual and crosslingual benchmarks, we evaluate and analyze a wide array of recent state-of-the-art monolingual and crosslingual representation models, including static and contextualized word embeddings (such as fastText, monolingual and multilingual BERT, XLM), externally informed lexical representations, as well as fully unsupervised and (weakly) supervised crosslingual word embeddings. We also present a step-by-step data set creation protocol for creating consistent, Multi-Simlex–style resources for additional languages. We make these contributions—the public release of Multi-SimLex data sets, their creation protocol, strong baseline results, and in-depth analyses which can be helpful in guiding future developments in multilingual lexical semantics and representation learning—available via a Web site that will encourage community effort in further expansion of Multi-Simlex to many more languages. Such a large-scale semantic resource could inspire significant further advances in NLP across languages.

Publisher

MIT Press - Journals

Subject

Artificial Intelligence,Computer Science Applications,Linguistics and Language,Language and Linguistics

Link

https://www.mitpressjournals.org/doi/pdf/10.1162/coli_a_00391

Reference179 articles.

1. Cross-Lingual Word Embeddings for Low-Resource Language Modeling

2. A study on similarity and relatedness using distributional and WordNet-based approaches

3. Context-Aware Cross-Lingual Mapping

4. Gromov-Wasserstein Alignment of Word Embedding Spaces

Cited by 21 articles. 订阅此论文施引文献订阅此论文施引文献，注册后可以免费订阅5篇论文的施引文献，订阅后可以查看论文全部施引文献

1. Extracting intersectional stereotypes from embeddings: Developing and validating the Flexible Intersectional Stereotype Extraction procedure;PNAS Nexus;2024-02-29

2. A Three Layer Chinese Sentiment Polarity Detection Framework with Case Study;Communications in Computer and Information Science;2024

3. On the Independence of Association Bias and Empirical Fairness in Language Models;2023 ACM Conference on Fairness, Accountability, and Transparency;2023-06-12

4. Assessing linguistic generalisation in language models: a dataset for Brazilian Portuguese;Language Resources and Evaluation;2023-06-02

5. Curating and extending data for language comparison in Concepticon and NoRaRe;Open Research Europe;2023-05-24