English–Welsh Cross-Lingual Embeddings-Reference-Cited by-同舟云学术

English–Welsh Cross-Lingual Embeddings

Published:2021-07-16 Issue:14 Volume:11 Page:6541
ISSN:2076-3417
Container-title:Applied Sciences
language:en
Short-container-title:Applied Sciences

Author:

Espinosa-Anke Luis,Palmer Geraint,Corcoran Padraig^ORCID,Filimonov Maxim^ORCID,Spasić Irena^ORCID,Knight Dawn

Abstract

Cross-lingual embeddings are vector space representations where word translations tend to be co-located. These representations enable learning transfer across languages, thus bridging the gap between data-rich languages such as English and others. In this paper, we present and evaluate a suite of cross-lingual embeddings for the English–Welsh language pair. To train the bilingual embeddings, a Welsh corpus of approximately 145 M words was combined with an English Wikipedia corpus. We used a bilingual dictionary to frame the problem of learning bilingual mappings as a supervised machine learning task, where a word vector space is first learned independently on a monolingual corpus, after which a linear alignment strategy is applied to map the monolingual embeddings to a common bilingual vector space. Two approaches were used to learn monolingual embeddings, including word2vec and fastText. Three cross-language alignment strategies were explored, including cosine similarity, inverted softmax and cross-domain similarity local scaling (CSLS). We evaluated different combinations of these approaches using two tasks, bilingual dictionary induction, and cross-lingual sentiment analysis. The best results were achieved using monolingual fastText embeddings and the CSLS metric. We also demonstrated that by including a few automatically translated training documents, the performance of a cross-lingual text classifier for Welsh can increase by approximately 20 percent points.

Funder

Llywodraeth Cymru

Publisher

MDPI AG

Subject

Fluid Flow and Transfer Processes,Computer Science Applications,Process Chemistry and Technology,General Engineering,Instrumentation,General Materials Science

Link

https://www.mdpi.com/2076-3417/11/14/6541/pdf

Reference56 articles.

Cited by 5 articles. 订阅此论文施引文献订阅此论文施引文献，注册后可以免费订阅5篇论文的施引文献，订阅后可以查看论文全部施引文献

1. FreeTxt: A corpus-based bilingual free-text survey and questionnaire data analysis toolkit;Applied Corpus Linguistics;2024-12

2. Linguistic expression of place appreciation in English and Welsh;Journal of Spatial Information Science;2022-06-20

3. Current Approaches and Applications in Natural Language Processing;Applied Sciences;2022-05-11

4. Supervised Bilingual Word Embeddings for Low-Resource Language Pairs: Myanmar and Thai;2021 16th International Joint Symposium on Artificial Intelligence and Natural Language Processing (iSAI-NLP);2021-12-21

5. Creating Welsh Language Word Embeddings;Applied Sciences;2021-07-27