Am I a Resource-Poor Language? Data Sets, Embeddings, Models and Analysis for four different NLP Tasks in Telugu Language

Author:

Marreddy Mounika1ORCID,Oota Subba Reddy2ORCID,Vakada Lakshmi Sireesha2ORCID,Chinni Venkata Charan2ORCID,Mamidi Radhika2ORCID

Affiliation:

1. IIITH, Hyderabad, Telengana, India

2. IIITH, Hyderabad, India

Abstract

Due to the lack of a large annotated corpus, many resource-poor Indian languages struggle to reap the benefits of recent deep feature representations in Natural Language Processing (NLP) . Moreover, adopting existing language models trained on large English corpora for Indian languages is often limited by data availability, rich morphological variation, syntax, and semantic differences. In this paper, we explore the traditional to recent efficient representations to overcome the challenges of a low resource language, Telugu. In particular, our main objective is to mitigate the low-resource problem for Telugu. Overall, we present several contributions to a resource-poor language viz. Telugu. (i) a large annotated data (35,142 sentences in each task) for multiple NLP tasks such as sentiment analysis, emotion identification, hate-speech detection, and sarcasm detection, (ii) we create different lexicons for sentiment, emotion, and hate-speech for improving the efficiency of the models, (iii) pretrained word and sentence embeddings, and (iv) different pretrained language models for Telugu such as ELMo-Te , BERT-Te , RoBERTa-Te , ALBERT-Te , and DistilBERT-Te on a large Telugu corpus consisting of 8,015,588 sentences (1,637,408 sentences from Telugu Wikipedia and 6,378,180 sentences crawled from different Telugu websites). Further, we show that these representations significantly improve the performance of four NLP tasks and present the benchmark results for Telugu. We argue that our pretrained embeddings are competitive or better than the existing multilingual pretrained models: mBERT , XLM-R , and IndicBERT . Lastly, the fine-tuning of pretrained models show higher performance than linear probing results on four NLP tasks with the following F1-scores: Sentiment (68.72), Emotion (58.04), Hate-Speech (64.27), and Sarcasm (77.93). We also experiment on publicly available Telugu datasets (Named Entity Recognition, Article Genre Classification, and Sentiment Analysis) and find that our Telugu pretrained language models ( BERT-Te and RoBERTa-Te ) outperform the state-of-the-art system except for the sentiment task. We open-source our corpus, four different datasets, lexicons, embeddings, and code  https://github.com/Cha14ran/DREAM-T. The pretrained Transformer models for Telugu are available at  https://huggingface.co/ltrctelugu.

Publisher

Association for Computing Machinery (ACM)

Subject

General Computer Science

Cited by 2 articles. 订阅此论文施引文献 订阅此论文施引文献,注册后可以免费订阅5篇论文的施引文献,订阅后可以查看论文全部施引文献

1. User-aware multilingual abusive content detection in social media;Information Processing & Management;2023-09

2. Hate Speech and Stereotypes with Artificial Neural Networks;Computational Science and Its Applications – ICCSA 2022 Workshops;2022

同舟云学术

1.学者识别学者识别

2.学术分析学术分析

3.人才评估人才评估

"同舟云学术"是以全球学者为主线,采集、加工和组织学术论文而形成的新型学术文献查询和分析系统,可以对全球学者进行文献检索和人才价值评估。用户可以通过关注某些学科领域的顶尖人物而持续追踪该领域的学科进展和研究前沿。经过近期的数据扩容,当前同舟云学术共收录了国内外主流学术期刊6万余种,收集的期刊论文及会议论文总量共计约1.5亿篇,并以每天添加12000余篇中外论文的速度递增。我们也可以为用户提供个性化、定制化的学者数据。欢迎来电咨询!咨询电话:010-8811{复制后删除}0370

www.globalauthorid.com

TOP

Copyright © 2019-2024 北京同舟云网络信息技术有限公司
京公网安备11010802033243号  京ICP备18003416号-3