Loanword Identification in Low-Resource Languages with Minimal Supervision-Reference-Cited by-同舟云学术

Loanword Identification in Low-Resource Languages with Minimal Supervision

Published:2020-05-31 Issue:3 Volume:19 Page:1-22
ISSN:2375-4699
Container-title:ACM Transactions on Asian and Low-Resource Language Information Processing
language:en
Short-container-title:ACM Trans. Asian Low-Resour. Lang. Inf. Process.

Author:

Mi Chenggang¹^ORCID,Xie Lei¹,Zhang Yanning¹

Affiliation:

1. Northwestern Polytechnical University, Shaanxi, China

Abstract

Bilingual resources play a very important role in many natural language processing tasks, especially the tasks in cross-lingual scenarios. However, it is expensive and time consuming to build such resources. Lexical borrowing happens in almost every language. This inspires us to detect these loanwords effectively, and to use the “loanword (in receipt language)”-“donor word (in donor language)” to extend the bilingual resource for NLP tasks in low-resource languages. In this article, we propose a novel method to identify loanwords in Uyghur. The most important advantage of this method is that the model only relies on large amount of monolingual corpora and only a small scale of annotated data. Our loanword identification model includes two parts: loanword candidate generation and loanword prediction. In the first part, we use two large-scale monolingual corpora and a small bilingual dictionary to train a cross-lingual embedding model. Since semantic unrelated words often cannot be treated as loanword pairs, a loanword candidate list will be generated according to this model and a word list in Uyghur. In the second part, we predict from the preceding candidates based on a log-linear model that integrates several features such as pronunciation similarity, part-of-speech tags, and hybrid language modeling. To evaluate the effectiveness of our proposed method, we conduct two types of experiments: loanword identification and OOV translation. Experimental results showed that (1) our proposed method achieved significant F1 improvements compared to other models in all four loanword identification tasks in Uyghur, and (2) after extending the existing translation models with loanword identification results, OOV rates in several language pairs reduced significantly and the translation performance improved.

Funder

National Natural Science Foundation of China

National Key Research and Development Program of China

Publisher

Association for Computing Machinery (ACM)

Subject

General Computer Science

Link

https://dl.acm.org/doi/pdf/10.1145/3374212

Reference48 articles.

1. A Comparison of Character Neural Language Model and Bootstrapping for Language Identification in Multilingual Noisy Texts

2. Antonio Barone and Valerio Miceli. 2016. Towards cross-lingual distributed representations without parallel text trained with adversarial autoencoders. arXiv:1608.02996. Antonio Barone and Valerio Miceli. 2016. Towards cross-lingual distributed representations without parallel text trained with adversarial autoencoders. arXiv:1608.02996.

3. Learning Crosslingual Word Embeddings without Bilingual Corpora

Cited by 5 articles. 订阅此论文施引文献订阅此论文施引文献，注册后可以免费订阅5篇论文的施引文献，订阅后可以查看论文全部施引文献

1. Loanword identification based on web resources: A case study on wikipedia;Computer Speech & Language;2023-06

2. Improving the Robustness of Loanword Identification in Social Media Texts;ACM Transactions on Asian and Low-Resource Language Information Processing;2023-03-24

3. The appeal of green advertisements on consumers' consumption intention based on low-resource machine translation;The Journal of Supercomputing;2022-10-12

4. Improving Loanword Identification in Low-Resource Language with Data Augmentation and Multiple Feature Fusion;Computational Intelligence and Neuroscience;2021-04-08

5. Methods and Trends of Machine Reading Comprehension in the Arabic Language;Computación y Sistemas;2020-12-09