Evaluation of Topic Identification Methods for Arabic Texts and their Combination by using a Corpus Extracted from the Omani Newspaper Alwatan

Author:

Abbas Mourad1,Smaili Kamel2,Berkani Daoud3

Affiliation:

1. The Center for Scientific and Technical Research for the Promotion of the Arabic Language, Algeria

2. Laurent's Laboratory for Research in Automated Media and its Applications, France

3. Higher Polytechnic School, Algeria

Abstract

Topic identification is used in several applications, as adapting language models for speech recognition and machine translation, focusing on a specific use for search engines, etc. Topic identification consists to assign one or several topic labels to a flow of textual data. Labels are chosen from a set of topics fixed a priori. In this paper, we present a study about identifying topics of Arabic texts. For this, a considerable amount of data is needed. Thus, we started by collecting texts from the website of the Omani newspaper “Alwatan”. The result is an Arabic corpus composed of more than 9000 articles corresponding to nearly 10 millions words. The considered topics in our experiments are: Culture, Religion, Economy, Local news, International news and sports. Some of the methods presented in this study, are well known in the text categorization community, as TFIDF classifier and kNN “k Nearest Neighbors”. The objective to use these methods is to compare them to TR-classifier “TRiggers-based classifier”, a new method that we have proposed, which is based on computing triggers or the Average Mutual Information of each couple of words. In order to enhance performances, we have combined results of the three methods by using three approaches: Majority Vote, Enhanced Majority Vote and Linear Combination.

Publisher

Emerald

Subject

Water Science and Technology,Agronomy and Crop Science,Ecology, Evolution, Behavior and Systematics

Reference33 articles.

1. Abbas, M (2008) Topic Identification for Automatic Speech Recognition. Ph.D. thesis, National Polytechnic School of Algiers.

2. Abbas, M, Smaili, K, and Berkani, D (2009) Multi-Category Support Vector Machines for Identifying Arabic Topics. Journal of Research in Computing Science 41: 217- 226.

3. Abbas, M, and Smaili, K (2005) Comparison of Topic Identification Methods for Arabic Language. Proceedings of the International conference on Recent Advances in Natural Language Processing (RANLP05), Borovets, Bulgary, pp. 14-17

4. Baoli, L, Yuzhong, C, and Shiwen, Y (2002) A Comparative Study on Automatic Categorization Methods for Chinese Search Engine. Proceedings of the 8th Joint International Computer Conference. Hangzhou, Zhejiang University Press, China, pp. 117-120.

5. Bates, JM, and Granger, CWJ (1969) The Combination of forecasts. Operational Research Quarterly 20: 451-468.

Cited by 1 articles. 订阅此论文施引文献 订阅此论文施引文献,注册后可以免费订阅5篇论文的施引文献,订阅后可以查看论文全部施引文献

同舟云学术

1.学者识别学者识别

2.学术分析学术分析

3.人才评估人才评估

"同舟云学术"是以全球学者为主线,采集、加工和组织学术论文而形成的新型学术文献查询和分析系统,可以对全球学者进行文献检索和人才价值评估。用户可以通过关注某些学科领域的顶尖人物而持续追踪该领域的学科进展和研究前沿。经过近期的数据扩容,当前同舟云学术共收录了国内外主流学术期刊6万余种,收集的期刊论文及会议论文总量共计约1.5亿篇,并以每天添加12000余篇中外论文的速度递增。我们也可以为用户提供个性化、定制化的学者数据。欢迎来电咨询!咨询电话:010-8811{复制后删除}0370

www.globalauthorid.com

TOP

Copyright © 2019-2024 北京同舟云网络信息技术有限公司
京公网安备11010802033243号  京ICP备18003416号-3