Sentiment Analysis of Code-Mixed Telugu-English Data Leveraging Syllable and Word Embeddings-Reference-Cited by-同舟云学术

Sentiment Analysis of Code-Mixed Telugu-English Data Leveraging Syllable and Word Embeddings

Published:2023-10-13 Issue:10 Volume:22 Page:1-30
ISSN:2375-4699
Container-title:ACM Transactions on Asian and Low-Resource Language Information Processing
language:en
Short-container-title:ACM Trans. Asian Low-Resour. Lang. Inf. Process.

Author:

Rayala Upendar Rao¹^ORCID,Seshadri Karthick²^ORCID,Sristy Nagesh Bhattu²^ORCID

Affiliation:

1. Department of Computer Science and Engineering, National Institute Technology, Andhra Pradesh, India, and Rajiv Gandhi University of Knowledge Technologies, Andhra Pradesh, India

2. Department of Computer Science and Engineering, National Institute of Technology, Andhra Pradesh, India

Abstract

Learning the inherent meaning of a word in Natural Language Processing (NLP) has motivated researchers to represent a word at various levels of abstraction, namely character-level, morpheme-level, and subword-level vector representations. Syllable-Aware Word Embedding (SAWE) can effectively handle agglutinative and fusion-based NLP tasks. However, research attempts on assessing the SAWE on such extrinsic NLP tasks has been scanty, especially for low-resource languages in the context of code-mixing with English. A model to learn SAWE to extract semantics at fine-grained subunits of a word is proposed in this article, and the representative ability of the embeddings is assessed through sentiment analysis of code-mixed Telugu-English review corpora. Multilingual societies and advancements in communication technologies have accounted for the prolific usage of mixed data, which renders the State-of-the-Art (SOTA) sentiment analysis models developed based on monolingual data ineffective. Social media users in the Indian subcontinent exhibit a tendency to mix English and their respective native language (using the phonetic form of English) in expressing their opinions or sentiments. A code-mixing scenario provides flexibility to borrow words from a foreign language, usage of shorthand notations, elongation of vowels, and usage of words without following syntactic/grammatical rules, which renders the sentiment analysis of code-mixed data challenging to perform. Deep neural architectures like Long Short-Term Memory and Gated Recurrent Unit networks have been shown to be effective in solving several NLP tasks, such as sequence labeling, named entity recognition, and machine translation. In this article, a framework to perform sentiment analysis on a code-mixed Telugu-English review corpus is implemented. Both word embedding and SAWE are input to a unified deep neural network that contains a two-level Bidirectional Long Short-Term Memory/Gated Recurrent Unit network with Softmax as the output layer. The proposed model leverages the advantages of both word embedding and SAWE, which enable the proposed model to outperform existing SOTA code-mixed sentiment analysis models on a Telugu-English code-mixed dataset of the International Institute of Information Technology–Hyderabad and a dataset curated by the authors. The improvement realized by the proposed model on these datasets is [3% increase in F1-score and 2% increase in accuracy] and [7% increase in F1-score and 5% in accuracy], respectively, in comparison with the best-performing SOTA model.

Publisher

Association for Computing Machinery (ACM)

Subject

General Computer Science

Link

https://dl.acm.org/doi/pdf/10.1145/3620670

Reference45 articles.

1. Syllable-aware Neural Language Models: A Failure to Beat Character-aware Ones

2. Yoshua Bengio, Réjean Ducharme, and Pascal Vincent. 2000. A neural probabilistic language model. In Advances in Neural Information Processing Systems 13 (NIPS 2000), Todd K. Leen, Thomas G. Dietterich, and Volker Tresp (Eds.). MIT Press, Cambridge, MA, 932–938. http://dblp.uni-trier.de/db/conf/nips/nips2000.html#BengioDV00

3. Improving Code-mixed POS Tagging Using Code-mixed Embeddings

4. Piotr Bojanowski Edouard Grave Armand Joulin and Tomas Mikolov. 2016. Enriching Word Vectors with Subword Information. arXiv:1607.04606 (2016). http://arxiv.org/abs/1607.04606

5. Compositional morphology for word representations and language modelling.;Botha Jan A.;CoRR,2014