Forward-backward Transliteration of Punjabi Gurmukhi Script Using N-gram Language Model

Author:

Goyal Kapil Dev1ORCID,Abbas Muhammad Raihan2ORCID,Goyal Vishal3ORCID,Saleem Yasir4ORCID

Affiliation:

1. S.B.A.S. Khalsa College, Sandaur, Malerkotla, Punjab, India

2. Faculty of Information Technology, University of Central Punjab, Lahore, Punjab, Pakistan

3. Department of Computer Science, Punjabi University, Patiala, Punjab, India

4. Department of Computer Science and Engineering, University of Engineering and Technology, Lahore, Punjab, Pakistan

Abstract

Transliterating the text of a language to a foreign script is called forward transliteration and transliterating the text back to the original script is called backward transliteration. In this work, we perform both forward as well as backward transliteration on Punjabi. We transliterate Punjabi person names from Gurmukhi script to English Roman script and from English Roman script back to Gurmukhi script using n-gram language model. We used more than one million parallel entities of person names in Gurmukhi and Roman script as the training corpus. We generated English to Punjabi and Punjabi to English n-grams databases from the corpus. To get better results, we tried to create as long n-grams as possible ranging from bi-gram to 30-gram. Our n-grams database contains more than 10 million n-grams, with each n-gram having multiple mappings of the other script. The most challenging part is to find the mapping for the given n-gram from the parallel name entity while creating n-grams databases. As per the orthography rules, the same combination of letters may have different pronunciation, depending upon its location in the word. Therefore, we categorized n-grams into starting, middle, and ending n-grams and used them accordingly in the transliteration process. The transliteration process works like the merge sort. We start searching the longest possible n-gram in the database and split the string recursively until the match is found. The transliterated strings are merged back to form the final output. In English to Punjabi transliteration, we achieved 96% accuracy using gold standard and 99.14% accuracy using minimum edit distance. In Punjabi to English transliteration, the result showed 96.85% and 99.35% accuracy for the gold standard and minimum edit distance, respectively.

Publisher

Association for Computing Machinery (ACM)

Subject

General Computer Science

Reference32 articles.

1. Nasreen Abdul Jaleel and Leah S. Larkey. 2003. Statistical transliteration for English-Arabic cross language information retrieval. In Proceedings of the 12th International Conference on Information and Knowledge Management. ACM, 139–146.

2. Lorna Balkan. 1994. Test Suites: Some Issues in Their Use and Design. Citeseer.

3. Rule based transliteration scheme for English to Punjabi;Bhalla Deepti;Int. J. Nat. Lang. Comput.,2013

4. Machine transliteration using SVM and HMM

5. Transliteration for Resource-Scarce Languages

Cited by 2 articles. 订阅此论文施引文献 订阅此论文施引文献,注册后可以免费订阅5篇论文的施引文献,订阅后可以查看论文全部施引文献

同舟云学术

1.学者识别学者识别

2.学术分析学术分析

3.人才评估人才评估

"同舟云学术"是以全球学者为主线,采集、加工和组织学术论文而形成的新型学术文献查询和分析系统,可以对全球学者进行文献检索和人才价值评估。用户可以通过关注某些学科领域的顶尖人物而持续追踪该领域的学科进展和研究前沿。经过近期的数据扩容,当前同舟云学术共收录了国内外主流学术期刊6万余种,收集的期刊论文及会议论文总量共计约1.5亿篇,并以每天添加12000余篇中外论文的速度递增。我们也可以为用户提供个性化、定制化的学者数据。欢迎来电咨询!咨询电话:010-8811{复制后删除}0370

www.globalauthorid.com

TOP

Copyright © 2019-2024 北京同舟云网络信息技术有限公司
京公网安备11010802033243号  京ICP备18003416号-3