Using Data Augmentation and Bidirectional Encoder Representations from Transformers for Improving Punjabi Named Entity Recognition-Reference-Cited by-同舟云学术

Using Data Augmentation and Bidirectional Encoder Representations from Transformers for Improving Punjabi Named Entity Recognition

Published:2023-06-16 Issue:6 Volume:22 Page:1-13
ISSN:2375-4699
Container-title:ACM Transactions on Asian and Low-Resource Language Information Processing
language:en
Short-container-title:ACM Trans. Asian Low-Resour. Lang. Inf. Process.

Author:

Khalid Hamza¹^ORCID,Murtaza Ghulam¹^ORCID,Abbas Qaiser¹^ORCID

Affiliation:

1. Department of Computer Science, University of Engineering and Technology Lahore, Punjab, Pakistan

Abstract

Named entity recognition (NER) is a task of proper noun identification from natural language text and classification into various types such as location, person, and organization. Due to NER's applications in different natural language processing (NLP) tasks, numerous NER approaches and benchmark datasets have been proposed. However, developing NER techniques for low-resource languages is still limited due to the absence of substantial training datasets. Punjabi is a classic example of low resource language. Although various researchers have worked on Punjabi, they focused on the Gurmukhi script. To overcome the challenges in developing NER for the Shahmukhi script, we present an improved technique for Punjabi NER for the Shahmukhi script in this paper. We firstly extend the existing dataset by adding new NER classes by leveraging a novel Pool of Words data augmentation strategy. Our extended dataset has 11,31,509 tokens and 1,25,789 labeled entities with more named entities (NEs) than the older dataset. In the next step, we fine-tuned a transformer model known as Bidirectional Encoder Representations from Transformers (BERT) for the NER task. We performed experiments using the proposed approach on a new and older dataset version, showing that our method achieved competitive results.

Publisher

Association for Computing Machinery (ACM)

Subject

General Computer Science

Link

https://dl.acm.org/doi/pdf/10.1145/3595861

Reference37 articles.

1. Named Entity Recognition in Natural Language Processing: A Systematic Review

2. Long short-term memory RNN for biomedical named entity recognition

3. Named entity recognition using support vector machine: A language independent approach;Ekbal A.;International Journal of Electrical and Computer Engineering,2010

4. Neural machine translation for low-resource languages: A survey;Ranathunga S.;arXiv preprint,2021

5. Low-Resource Named Entity Recognition via the Pre-Training Model

Cited by 1 articles. 订阅此论文施引文献订阅此论文施引文献，注册后可以免费订阅5篇论文的施引文献，订阅后可以查看论文全部施引文献

1. Advancing NLP for Punjabi Language: A Comprehensive Review of Language Processing Challenges and Opportunities;2024 2nd International Conference on Intelligent Data Communication Technologies and Internet of Things (IDCIoT);2024-01-04