Machine learning based framework for fine-grained word segmentation and enhanced text normalization for low resourced language-Reference-Cited by-同舟云学术

Machine learning based framework for fine-grained word segmentation and enhanced text normalization for low resourced language

Published:2024-01-31 Issue: Volume:10 Page:e1704
ISSN:2376-5992
Container-title:PeerJ Computer Science
language:en
Short-container-title:

Author:

Nazir Shahzad¹,Asif Muhammad¹,Rehman Mariam²,Ahmad Shahbaz¹

Affiliation:

1. Department of Computer Science, National Textile University, Faisalabad, Pakistan

2. Department of Information Technology, Government College University, Faisalabad, Faisalabad, Pakistan

Abstract

In text applications, pre-processing is deemed as a significant parameter to enhance the outcomes of natural language processing (NLP) chores. Text normalization and tokenization are two pivotal procedures of text pre-processing that cannot be overstated. Text normalization refers to transforming raw text into scriptural standardized text, while word tokenization splits the text into tokens or words. Well defined normalization and tokenization approaches exist for most spoken languages in world. However, the world’s 10th most widely spoken language has been overlooked by the research community. This research presents improved text normalization and tokenization techniques for the Urdu language. For Urdu text normalization, multiple regular expressions and rules are proposed, including removing diuretics, normalizing single characters, separating digits, etc. While for word tokenization, core features are defined and extracted against each character of text. Machine learning model is considered with specified handcrafted rules to predict the space and to tokenize the text. This experiment is performed, while creating the largest human-annotated dataset composed in Urdu script covering five different domains. The results have been evaluated using precision, recall, F-measure, and accuracy. Further, the results are compared with state-of-the-art. The normalization approach produced 20% and tokenization approach achieved 6% improvement.

Publisher

PeerJ

Link

https://peerj.com/articles/cs-1704.pdf

Reference38 articles.

1. Reduplication in English and Urdu;Afraz;PhD thesis,2012

2. Word segmentation for Urdu OCR system;Akram,2010

3. Urdu news article recommendation model using natural language processing techniques;Abbas,2022

4. Text summarization techniques: a brief survey;Allahyari,2017

5. Vard2: a tool for dealing with spelling variation in historical corpora;Baron,2008

Cited by 1 articles. 订阅此论文施引文献订阅此论文施引文献，注册后可以免费订阅5篇论文的施引文献，订阅后可以查看论文全部施引文献

1. Leveraging Hybrid Adaptive Sine Cosine Algorithm with Deep Learning for Arabic Poem Meter Detection;ACM Transactions on Asian and Low-Resource Language Information Processing;2024-07-10