Corpus creation and language identification for code-mixed Indonesian-Javanese-English Tweets-Reference-Cited by-同舟云学术

Corpus creation and language identification for code-mixed Indonesian-Javanese-English Tweets

Published:2023-06-22 Issue: Volume:9 Page:e1312
ISSN:2376-5992
Container-title:PeerJ Computer Science
language:en
Short-container-title:

Author:

Hidayatullah Ahmad Fathan¹²,Apong Rosyzie Anna¹,Lai Daphne T.C.¹,Qazi Atika³

Affiliation:

1. School of Digital Science, Universiti Brunei Darussalam, Bandar Seri Begawan, Brunei

2. Department of Informatics, Universitas Islam Indonesia, Sleman, Yogyakarta, Indonesia

3. Centre for Lifelong Learning, Universiti Brunei Darussalam, Bandar Seri Begawan, Brunei

Abstract

With the massive use of social media today, mixing between languages in social media text is prevalent. In linguistics, the phenomenon of mixing languages is known as code-mixing. The prevalence of code-mixing exposes various concerns and challenges in natural language processing (NLP), including language identification (LID) tasks. This study presents a word-level language identification model for code-mixed Indonesian, Javanese, and English tweets. First, we introduce a code-mixed corpus for Indonesian-Javanese-English language identification (IJELID). To ensure reliable dataset annotation, we provide full details of the data collection and annotation standards construction procedures. Some challenges encountered during corpus creation are also discussed in this paper. Then, we investigate several strategies for developing code-mixed language identification models, such as fine-tuning BERT, BLSTM-based, and CRF. Our results show that fine-tuned IndoBERTweet models can identify languages better than the other techniques. This is the result of BERT’s ability to understand each word’s context from the given text sequence. Finally, we show that sub-word language representation in BERT models can provide a reliable model for identifying languages in code-mixed texts.

Funder

Universiti Brunei Darussalam

Publisher

PeerJ

Subject

General Computer Science

Link

https://peerj.com/articles/cs-1312.pdf

Reference57 articles.

1. IndoRobusta: towards robustness against diverse code-mixed indonesian local languages;Adilazuarda,2022

2. One Country, 700+ languages: NLP challenges for underrepresented languages and dialects in Indonesia;Aji,2022

3. Language identification of hindi-english tweets using code-mixed BERT;Ansari,2021

4. Sentiment analysis of mixed code for the transliterated Hindi and Marathi texts;Ansari;International Journal on Natural Language Computing,2018

5. Aspect-based sentiment analysis on indonesia’s tourism destinations based on google maps user code-mixed reviews (study case: borobudur and prambanan temples);Arianto,2020

Cited by 5 articles. 订阅此论文施引文献订阅此论文施引文献，注册后可以免费订阅5篇论文的施引文献，订阅后可以查看论文全部施引文献

1. Leveraging Natural Language Processing for Enhanced Text Analysis in Business Intelligence;Advances in Computational Intelligence and Robotics;2024-08-30

2. Cloud-Based Offensive Code Mixed Text Classification Using Hierarchical Attention Network;Advances in Systems Analysis, Software Engineering, and High Performance Computing;2024-03-29

3. Special issue on analysis and mining of social media data;PeerJ Computer Science;2024-02-29

4. Sentiment Analysis in Low-Resource Settings: A Comprehensive Review of Approaches, Languages, and Data Sources;IEEE Access;2024

5. Data Augmentation Approach for Language Identification in Imbalanced Bilingual Code-Mixed Social Media Datasets;2023 IEEE International Conference on Artificial Intelligence in Engineering and Technology (IICAIET);2023-09-12