Sentiment Analysis of Code-Mixed Telugu-English Data Leveraging Syllable and Word Embeddings

Author:

Rayala Upendar Rao1ORCID,Seshadri Karthick2ORCID,Sristy Nagesh Bhattu2ORCID

Affiliation:

1. Department of Computer Science and Engineering, National Institute Technology, Andhra Pradesh, India, and Rajiv Gandhi University of Knowledge Technologies, Andhra Pradesh, India

2. Department of Computer Science and Engineering, National Institute of Technology, Andhra Pradesh, India

Abstract

Learning the inherent meaning of a word in Natural Language Processing (NLP) has motivated researchers to represent a word at various levels of abstraction, namely character-level, morpheme-level, and subword-level vector representations. Syllable-Aware Word Embedding (SAWE) can effectively handle agglutinative and fusion-based NLP tasks. However, research attempts on assessing the SAWE on such extrinsic NLP tasks has been scanty, especially for low-resource languages in the context of code-mixing with English. A model to learn SAWE to extract semantics at fine-grained subunits of a word is proposed in this article, and the representative ability of the embeddings is assessed through sentiment analysis of code-mixed Telugu-English review corpora. Multilingual societies and advancements in communication technologies have accounted for the prolific usage of mixed data, which renders the State-of-the-Art (SOTA) sentiment analysis models developed based on monolingual data ineffective. Social media users in the Indian subcontinent exhibit a tendency to mix English and their respective native language (using the phonetic form of English) in expressing their opinions or sentiments. A code-mixing scenario provides flexibility to borrow words from a foreign language, usage of shorthand notations, elongation of vowels, and usage of words without following syntactic/grammatical rules, which renders the sentiment analysis of code-mixed data challenging to perform. Deep neural architectures like Long Short-Term Memory and Gated Recurrent Unit networks have been shown to be effective in solving several NLP tasks, such as sequence labeling, named entity recognition, and machine translation. In this article, a framework to perform sentiment analysis on a code-mixed Telugu-English review corpus is implemented. Both word embedding and SAWE are input to a unified deep neural network that contains a two-level Bidirectional Long Short-Term Memory/Gated Recurrent Unit network with Softmax as the output layer. The proposed model leverages the advantages of both word embedding and SAWE, which enable the proposed model to outperform existing SOTA code-mixed sentiment analysis models on a Telugu-English code-mixed dataset of the International Institute of Information Technology–Hyderabad and a dataset curated by the authors. The improvement realized by the proposed model on these datasets is [3% increase in F1-score and 2% increase in accuracy] and [7% increase in F1-score and 5% in accuracy], respectively, in comparison with the best-performing SOTA model.

Publisher

Association for Computing Machinery (ACM)

Subject

General Computer Science

Reference45 articles.

1. Syllable-aware Neural Language Models: A Failure to Beat Character-aware Ones

2. Yoshua Bengio, Réjean Ducharme, and Pascal Vincent. 2000. A neural probabilistic language model. In Advances in Neural Information Processing Systems 13 (NIPS 2000), Todd K. Leen, Thomas G. Dietterich, and Volker Tresp (Eds.). MIT Press, Cambridge, MA, 932–938. http://dblp.uni-trier.de/db/conf/nips/nips2000.html#BengioDV00

3. Improving Code-mixed POS Tagging Using Code-mixed Embeddings

4. Piotr Bojanowski Edouard Grave Armand Joulin and Tomas Mikolov. 2016. Enriching Word Vectors with Subword Information. arXiv:1607.04606 (2016). http://arxiv.org/abs/1607.04606

5. Compositional morphology for word representations and language modelling.;Botha Jan A.;CoRR,2014

同舟云学术

1.学者识别学者识别

2.学术分析学术分析

3.人才评估人才评估

"同舟云学术"是以全球学者为主线,采集、加工和组织学术论文而形成的新型学术文献查询和分析系统,可以对全球学者进行文献检索和人才价值评估。用户可以通过关注某些学科领域的顶尖人物而持续追踪该领域的学科进展和研究前沿。经过近期的数据扩容,当前同舟云学术共收录了国内外主流学术期刊6万余种,收集的期刊论文及会议论文总量共计约1.5亿篇,并以每天添加12000余篇中外论文的速度递增。我们也可以为用户提供个性化、定制化的学者数据。欢迎来电咨询!咨询电话:010-8811{复制后删除}0370

www.globalauthorid.com

TOP

Copyright © 2019-2024 北京同舟云网络信息技术有限公司
京公网安备11010802033243号  京ICP备18003416号-3