Forward-backward Transliteration of Punjabi Gurmukhi Script Using N-gram Language Model-Reference-Cited by-同舟云学术

Forward-backward Transliteration of Punjabi Gurmukhi Script Using N-gram Language Model

Published:2022-12-27 Issue:2 Volume:22 Page:1-24
ISSN:2375-4699
Container-title:ACM Transactions on Asian and Low-Resource Language Information Processing
language:en
Short-container-title:ACM Trans. Asian Low-Resour. Lang. Inf. Process.

Author:

Goyal Kapil Dev¹^ORCID,Abbas Muhammad Raihan²^ORCID,Goyal Vishal³^ORCID,Saleem Yasir⁴^ORCID

Affiliation:

1. S.B.A.S. Khalsa College, Sandaur, Malerkotla, Punjab, India

2. Faculty of Information Technology, University of Central Punjab, Lahore, Punjab, Pakistan

3. Department of Computer Science, Punjabi University, Patiala, Punjab, India

4. Department of Computer Science and Engineering, University of Engineering and Technology, Lahore, Punjab, Pakistan

Abstract

Transliterating the text of a language to a foreign script is called forward transliteration and transliterating the text back to the original script is called backward transliteration. In this work, we perform both forward as well as backward transliteration on Punjabi. We transliterate Punjabi person names from Gurmukhi script to English Roman script and from English Roman script back to Gurmukhi script using n-gram language model. We used more than one million parallel entities of person names in Gurmukhi and Roman script as the training corpus. We generated English to Punjabi and Punjabi to English n-grams databases from the corpus. To get better results, we tried to create as long n-grams as possible ranging from bi-gram to 30-gram. Our n-grams database contains more than 10 million n-grams, with each n-gram having multiple mappings of the other script. The most challenging part is to find the mapping for the given n-gram from the parallel name entity while creating n-grams databases. As per the orthography rules, the same combination of letters may have different pronunciation, depending upon its location in the word. Therefore, we categorized n-grams into starting, middle, and ending n-grams and used them accordingly in the transliteration process. The transliteration process works like the merge sort. We start searching the longest possible n-gram in the database and split the string recursively until the match is found. The transliterated strings are merged back to form the final output. In English to Punjabi transliteration, we achieved 96% accuracy using gold standard and 99.14% accuracy using minimum edit distance. In Punjabi to English transliteration, the result showed 96.85% and 99.35% accuracy for the gold standard and minimum edit distance, respectively.

Publisher

Association for Computing Machinery (ACM)

Subject

General Computer Science

Link

https://dl.acm.org/doi/pdf/10.1145/3542924

Reference32 articles.

1. Nasreen Abdul Jaleel and Leah S. Larkey. 2003. Statistical transliteration for English-Arabic cross language information retrieval. In Proceedings of the 12th International Conference on Information and Knowledge Management. ACM, 139–146.

2. Lorna Balkan. 1994. Test Suites: Some Issues in Their Use and Design. Citeseer.

3. Rule based transliteration scheme for English to Punjabi;Bhalla Deepti;Int. J. Nat. Lang. Comput.,2013

4. Machine transliteration using SVM and HMM

5. Transliteration for Resource-Scarce Languages

Cited by 2 articles. 订阅此论文施引文献订阅此论文施引文献，注册后可以免费订阅5篇论文的施引文献，订阅后可以查看论文全部施引文献

1. A Survey of Advancements in Real-Time Sign Language Translators: Integration with IoT Technology;Technologies;2023-06-22

2. Automatic Transliteration of Polish and English Proper Nouns into Lithuanian;Information Technology and Control;2023-03-28