Bilingual Corpus-based Hybrid POS Tagger for Low Resource Tamil Language: A Statistical approach

Author:

Senthamizh Selvi S.1,Anitha R.1

Affiliation:

1. Department of Computer Science and Engineering, Sri Venkateswara College of Engineering, Tamil Nadu, India

Abstract

In India, most of the Science and Technology resources available are in English. Developing an Automatic Language Translation Engine from English (source language) to Tamil (target language) is very essential for the people who need to get technical resources in their native language. The challenges in designing such engines using Natural Language Processing (NLP) tools include Lexical, Structural, and Syntax level ambiguity. To solve these challenges, the development of a Part-Of-Speech (POS) tagger is essential. The Verb-Framed languages like Tamil, Japanese, and many languages in Romance, Semitic, and Mayan languages families have high morphological richness but lack either a large volume of annotated corpora or manually constructed linguistic resources for building POS tagger. Moreover, the Tamil Language has a low resource, high word sense ambiguity, and word-free order form giving rise to challenges in designing Tamil POS taggers. In this paper, we postulate a Hybrid POS tagger algorithm for Tamil Language using Cross-Lingual Transformation Learning Techniques. It is a novel Mining-based algorithm (MT), which finds equivalent words of Tamil in English on less volume of English-Tamil bilingual unannotated parallel corpus. To enhance the performance of MT, we developed Tamil language-specific auxiliary algorithms such as Keyword-based tagging algorithm (KT) and Verb pattern-based tagging algorithm (VT). We also developed a Unique pair occurrence-tagging algorithm (UT) to find the one-time occurrence of Tamil-English pair words. Our experiments show that by improving Context-based Bilingual Corpus to Bilingual parallel corpus and after leaving one-time occurrence words, the proposed Hybrid POS tagger can predict 81.15% words, with 73.51% accuracy and 90.50% precision. Evaluations prove our algorithms can generate language resources, which can improve the performance of NLP tasks in Tamil.

Publisher

IOS Press

Subject

Artificial Intelligence,General Engineering,Statistics and Probability

Reference12 articles.

1. Kernel based part of speech tagger for kannada;Antony;International Conference on Machine Learning and Cybernetics, IEEE,2010

2. Learning character-level representations for part-of-speech tagging;Cicero Dos Santos;Proceedings of the 31st International Conference on Machine Learning, PMLR,2014

3. Dhanalakshmi V. , Anand Kumar , Shivapratap G. , Soman K.P. Rajendran S. , Tamil POS Tagging using Linear Programming International Journal of Recent Trends in Engineering, 1(2) (2009) https://www.semanticscholar.org/paper/Tamil-POS-Tagging-using-Linear-Programming-Dhanalakshmi-Kumar/18a4e319cb0093be3cb9cf9408ee55ad0fe2b44f.

4. Statistical language learning;Eugene Charniak;Language,1997

5. Hehnut Schmid , Part-of-speech tagging with neural networks. COLING’94: Proceedings of the 15th conference on Computational linguistics. 1 (1994), https://doi.org/10.3115/991886.991915.

同舟云学术

1.学者识别学者识别

2.学术分析学术分析

3.人才评估人才评估

"同舟云学术"是以全球学者为主线,采集、加工和组织学术论文而形成的新型学术文献查询和分析系统,可以对全球学者进行文献检索和人才价值评估。用户可以通过关注某些学科领域的顶尖人物而持续追踪该领域的学科进展和研究前沿。经过近期的数据扩容,当前同舟云学术共收录了国内外主流学术期刊6万余种,收集的期刊论文及会议论文总量共计约1.5亿篇,并以每天添加12000余篇中外论文的速度递增。我们也可以为用户提供个性化、定制化的学者数据。欢迎来电咨询!咨询电话:010-8811{复制后删除}0370

www.globalauthorid.com

TOP

Copyright © 2019-2024 北京同舟云网络信息技术有限公司
京公网安备11010802033243号  京ICP备18003416号-3