NERA 2.0: Improving coverage and performance of rule-based named entity recognition for Arabic

Author:

OUDAH MAIORCID,SHAALAN KHALEDORCID

Abstract

AbstractNamed Entity Recognition (NER) is an essential task for many natural language processing systems, which makes use of various linguistic resources. NER becomes more complicated when the language in use is morphologically rich and structurally complex, such as Arabic. This language has a set of characteristics that makes it particularly challenging to handle. In a previous work, we have proposed an Arabic NER system that follows the hybrid approach, i.e. integrates both rule-based and machine learning-based NER approaches. Our hybrid NER system is the state-of-the-art in Arabic NER according to its performance on standard evaluation datasets. In this article, we discuss a novel methodology for overcoming the coverage drawback of rule-based NER systems in order to improve their performance and allow for automated rule update. The presented mechanism utilizes the recognition decisions made by the hybrid NER system in order to identify the weaknesses of the rule-based component and derive new linguistic rules aiming at enhancing the rule base, which will help in achieving more reliable and accurate results. We used ACE 2004 Newswire standard dataset as a resource for extracting and analyzing new linguistic rules for person, location and organization names recognition. We formulate each new rule based on two distinctive feature groups, i.e. Gazetteers of each type of named entities and Part-of-Speech tags, in particular noun and proper noun. Fourteen new patterns are derived, formulated as grammar rules, and evaluated in terms of coverage. The conducted experiments exploit a POS tagged version of the ACE 2004 NW dataset. The empirical results show that the performance of the enhanced rule-based system, i.e. NERA 2.0, improves the coverage of the previously misclassified person, location and organization named entities types by 69.93 per cent, 57.09 per cent and 54.28 per cent, respectively.

Publisher

Cambridge University Press (CUP)

Subject

Artificial Intelligence,Linguistics and Language,Language and Linguistics,Software

Reference59 articles.

1. Abouenour L. , Bouzoubaa K. and Rosso P. 2012. IDRAAQ: new arabic question answering system based on query expansion and passage retrieval. CLEF (Online Working Notes/Labs/Workshop).

2. ARABIC PERSON NAMES RECOGNITION BY USING A RULE BASED APPROACH

3. Arabic Information Retrieval

4. Integrating Rule-Based System with Classification for Arabic Named Entity Recognition

5. Alias I. 2008. ‘LingPipe 4.1.0’. http://alias-i.com/lingpipe (accessed October 2012).

Cited by 25 articles. 订阅此论文施引文献 订阅此论文施引文献,注册后可以免费订阅5篇论文的施引文献,订阅后可以查看论文全部施引文献

1. Named Entity Recognition of Tunisian Arabic Using the Bi-LSTM-CRF Model;International Journal on Artificial Intelligence Tools;2023-09-28

2. Chinese Named Entity Recognition in Football Based on ALBERT-BiLSTM Model;Applied Sciences;2023-09-28

3. Evaluation on Network Social Media Named Entity Recognition Model Based on Active Learning;ACM Transactions on Asian and Low-Resource Language Information Processing;2023-09-05

4. Comparing Open Arabic Named Entity Recognition Tools;2023 IEEE 24th International Conference on Information Reuse and Integration for Data Science (IRI);2023-08

5. Challenges and Solutions for Arabic Natural Language Processing in Social Media;Business Intelligence and Information Technology;2023

同舟云学术

1.学者识别学者识别

2.学术分析学术分析

3.人才评估人才评估

"同舟云学术"是以全球学者为主线,采集、加工和组织学术论文而形成的新型学术文献查询和分析系统,可以对全球学者进行文献检索和人才价值评估。用户可以通过关注某些学科领域的顶尖人物而持续追踪该领域的学科进展和研究前沿。经过近期的数据扩容,当前同舟云学术共收录了国内外主流学术期刊6万余种,收集的期刊论文及会议论文总量共计约1.5亿篇,并以每天添加12000余篇中外论文的速度递增。我们也可以为用户提供个性化、定制化的学者数据。欢迎来电咨询!咨询电话:010-8811{复制后删除}0370

www.globalauthorid.com

TOP

Copyright © 2019-2024 北京同舟云网络信息技术有限公司
京公网安备11010802033243号  京ICP备18003416号-3