Bilingual Corpus-based Hybrid POS Tagger for Low Resource Tamil Language: A Statistical approach-Reference-Cited by-同舟云学术

Bilingual Corpus-based Hybrid POS Tagger for Low Resource Tamil Language: A Statistical approach

Published:2022-11-11 Issue:6 Volume:43 Page:8329-8348
ISSN:1064-1246
Container-title:Journal of Intelligent & Fuzzy Systems
language:
Short-container-title:IFS

Author:

Senthamizh Selvi S.¹,Anitha R.¹

Affiliation:

1. Department of Computer Science and Engineering, Sri Venkateswara College of Engineering, Tamil Nadu, India

Abstract

In India, most of the Science and Technology resources available are in English. Developing an Automatic Language Translation Engine from English (source language) to Tamil (target language) is very essential for the people who need to get technical resources in their native language. The challenges in designing such engines using Natural Language Processing (NLP) tools include Lexical, Structural, and Syntax level ambiguity. To solve these challenges, the development of a Part-Of-Speech (POS) tagger is essential. The Verb-Framed languages like Tamil, Japanese, and many languages in Romance, Semitic, and Mayan languages families have high morphological richness but lack either a large volume of annotated corpora or manually constructed linguistic resources for building POS tagger. Moreover, the Tamil Language has a low resource, high word sense ambiguity, and word-free order form giving rise to challenges in designing Tamil POS taggers. In this paper, we postulate a Hybrid POS tagger algorithm for Tamil Language using Cross-Lingual Transformation Learning Techniques. It is a novel Mining-based algorithm (MT), which finds equivalent words of Tamil in English on less volume of English-Tamil bilingual unannotated parallel corpus. To enhance the performance of MT, we developed Tamil language-specific auxiliary algorithms such as Keyword-based tagging algorithm (KT) and Verb pattern-based tagging algorithm (VT). We also developed a Unique pair occurrence-tagging algorithm (UT) to find the one-time occurrence of Tamil-English pair words. Our experiments show that by improving Context-based Bilingual Corpus to Bilingual parallel corpus and after leaving one-time occurrence words, the proposed Hybrid POS tagger can predict 81.15% words, with 73.51% accuracy and 90.50% precision. Evaluations prove our algorithms can generate language resources, which can improve the performance of NLP tasks in Tamil.

Publisher

IOS Press

Subject

Artificial Intelligence,General Engineering,Statistics and Probability

Reference12 articles.

1. Kernel based part of speech tagger for kannada;Antony;International Conference on Machine Learning and Cybernetics, IEEE,2010

2. Learning character-level representations for part-of-speech tagging;Cicero Dos Santos;Proceedings of the 31st International Conference on Machine Learning, PMLR,2014

3. Dhanalakshmi V. , Anand Kumar , Shivapratap G. , Soman K.P. Rajendran S. , Tamil POS Tagging using Linear Programming International Journal of Recent Trends in Engineering, 1(2) (2009) https://www.semanticscholar.org/paper/Tamil-POS-Tagging-using-Linear-Programming-Dhanalakshmi-Kumar/18a4e319cb0093be3cb9cf9408ee55ad0fe2b44f.

4. Statistical language learning;Eugene Charniak;Language,1997

5. Hehnut Schmid , Part-of-speech tagging with neural networks. COLING’94: Proceedings of the 15th conference on Computational linguistics. 1 (1994), https://doi.org/10.3115/991886.991915.