SwitchNet: Learning to switch for word-level language identification in code-mixed social media text-Reference-Cited by-同舟云学术

SwitchNet: Learning to switch for word-level language identification in code-mixed social media text

Published:2021-06-03 Issue: Volume: Page:1-23
ISSN:1351-3249
Container-title:Natural Language Engineering
language:en
Short-container-title:Nat. Lang. Eng.

Author:

Sarma Neelakshi^ORCID,Sanasam Singh Ranbir,Goswami Diganta

Abstract

Abstract Word-level language identification is an essential prerequisite for extracting useful information from code-mixed social media content. Previous studies in word-level language identification show two important observations. First, the local context is an important indicator of the language of a word when a word is valid in multiple languages. Second, considering the word in isolation from its context leads to more effective language classification when a word is borrowed or embedded into sentences of other languages. In this paper, we propose a framework for language identification that makes use of a dynamic switching mechanism for effective language classification of both words that are borrowed or embedded from other languages as well as words that are valid in multiple languages. For a given input, the proposed switching mechanism makes a dynamic decision to bias its prediction either towards the prediction obtained by the contextual information or that obtained by the word in isolation. In contrast to existing studies that rely upon large amounts of annotated data for robust performance in a multilingual environment, the proposed approach uses minimal annotated resources and no external resources, making it easily extendible to newer languages. Evaluation over a corpus of transliterated Facebook comments shows that the proposed approach outperforms its baseline counterparts: classification based on the contextual information, classification based on the word in isolation, as well as an ensemble of the two classifiers.

Publisher

Cambridge University Press (CUP)

Subject

Artificial Intelligence,Linguistics and Language,Language and Linguistics,Software

Reference46 articles.

1. Subword-Level Language Identification for Intra-Word Code-Switching

2. Nguyen, D. and Doğruöz, A.S. (2013) Word level language identification in online multilingual communication. In Proceedings of the Conference on Empirical Methods in Natural Language Processing, Washington, USA, pp. 857–862.

3. Devlin, J. , Chang, M.-W. , Lee, K. and Toutanova, K. (2019) Bert: Pre-training of deep bidirectional transformers for language understanding. In Proceedings of the 2019 Conference of the North American Chapter of the Association for Computational Linguistics, Minneapolis, Minnesota, pp. 4171–4186.

4. Overview for the First Shared Task on Language Identification in Code-Switched Data

5. Gated Word-Character Recurrent Language Model

Cited by 7 articles. 订阅此论文施引文献订阅此论文施引文献，注册后可以免费订阅5篇论文的施引文献，订阅后可以查看论文全部施引文献

1. A Framework to Find Single Language Version Using Pattern Analysis in Mixed Script Queries;2024 2nd International Conference on Disruptive Technologies (ICDT);2024-03-15

2. Features and Methods;Synthesis Lectures on Human Language Technologies;2024

3. A Comprehensive Survey of Techniques Used for Part-of-Speech Tagging of Code-Mixed Social Media Text;2023-08-23

4. Corpus creation and language identification for code-mixed Indonesian-Javanese-English Tweets;PeerJ Computer Science;2023-06-22

5. Tackling the multilingual and heterogeneous documents with the pre-trained language identifiers;International Journal of Computers and Applications;2023-05-04