Improving Code-mixed POS Tagging Using Code-mixed Embeddings

Author:

Bhattu S. Nagesh1,Nunna Satya Krishna2,Somayajulu D. V. L. N.3,Pradhan Binay4

Affiliation:

1. National Institute of Technology Andhra Pradesh, Andhra Pradesh, India

2. IDRBT and National Institute of Technology, Andhra Pradesh, India

3. National Institute of Technology and IIITDMKL, Warangal, Andhra Pradesh, India

4. International Institute of Information Technology, Odisha, India

Abstract

Social media data has become invaluable component of business analytics. A multitude of nuances of social media text make the job of conventional text analytical tools difficult. Code-mixing of text is a phenomenon prevalent among social media users, wherein words used are borrowed from multiple languages, though written in the commonly understood roman script. All the existing supervised learning methods for tasks such as Parts Of Speech (POS) tagging for code-mixed social media (CMSM) text typically depend on a large amount of training data. Preparation of such large training data is resource-intensive, requiring expertise in multiple languages. Though the preparation of small dataset is possible, the out of vocabulary (OOV) words pose major difficulty, while learning models from CMSM text as the number of different ways of writing non-native words in roman script is huge. POS tagging for code-mixed text is non-trivial, as tagging should deal with syntactic rules of multiple languages. The important research question addressed by this article is whether abundantly available unlabeled data can help in resolving the difficulties posed by code-mixed text for POS tagging. We develop an approach for scraping and building word embeddings for code-mixed text illustrating it for Bengali-English, Hindi-English, and Telugu-English code-mixing scenarios. We used a hierarchical deep recurrent neural network with linear-chain CRF layer on top of it to improve the performance of POS tagging in CMSM text by capturing contextual word features and character-sequence–based information. We prepared a labeled resource for POS tagging of CMSM text by correcting 19% of labels from an existing resource. A detailed analysis of the performance of our approach with varying levels of code-mixing is provided. The results indicate that the F1-score of our approach with custom embeddings is better than the CRF-based baseline by 5.81%, 5.69%, and 6.3% in Bengali, Hindi , and Telugu languages, respectively.

Publisher

Association for Computing Machinery (ACM)

Subject

General Computer Science

Reference82 articles.

1. Robust Part-of-speech Tagging of Arabic Text

2. On constituent chunking for;Aslan Ozkan;Turkish. Inf. Proc. Manag.,2018

3. An empirical study on POS tagging for Vietnamese social media text

Cited by 8 articles. 订阅此论文施引文献 订阅此论文施引文献,注册后可以免费订阅5篇论文的施引文献,订阅后可以查看论文全部施引文献

1. Language augmentation approach for code-mixed text classification;Natural Language Processing Journal;2023-12

2. Sentiment Analysis of Code-Mixed Telugu-English Data Leveraging Syllable and Word Embeddings;ACM Transactions on Asian and Low-Resource Language Information Processing;2023-10-13

3. The Effect of Stopword Removal on Information Retrieval for Code-Mixed Data Obtained Via Social Media;SN Computer Science;2023-06-27

4. Homograph Language Identification Using Machine Learning Techniques;Proceedings of International Conference on Data Science and Applications;2023

5. Character-level inclusive transformer architecture for information gain in low resource code-mixed language;Neural Computing and Applications;2022-03-09

同舟云学术

1.学者识别学者识别

2.学术分析学术分析

3.人才评估人才评估

"同舟云学术"是以全球学者为主线,采集、加工和组织学术论文而形成的新型学术文献查询和分析系统,可以对全球学者进行文献检索和人才价值评估。用户可以通过关注某些学科领域的顶尖人物而持续追踪该领域的学科进展和研究前沿。经过近期的数据扩容,当前同舟云学术共收录了国内外主流学术期刊6万余种,收集的期刊论文及会议论文总量共计约1.5亿篇,并以每天添加12000余篇中外论文的速度递增。我们也可以为用户提供个性化、定制化的学者数据。欢迎来电咨询!咨询电话:010-8811{复制后删除}0370

www.globalauthorid.com

TOP

Copyright © 2019-2024 北京同舟云网络信息技术有限公司
京公网安备11010802033243号  京ICP备18003416号-3