Uyghur–Kazakh–Kirghiz Text Keyword Extraction Based on Morpheme Segmentation

Author:

Parhat Sardar1,Sattar Mutallip1,Hamdulla Askar2ORCID,Kadir Abdurahman1

Affiliation:

1. College of Information Management, Xinjiang University of Finance and Economics, Urumqi 830012, China

2. College of Information Science and Engineering, Xinjiang University, Urumqi 830046, China

Abstract

In this study, based on a morpheme segmentation framework, we researched a text keyword extraction method for Uyghur, Kazakh and Kirghiz languages, which have similar grammatical and lexical structures. In these languages, affixes and a stem are joined together to form a word. A stem is a word particle with a notional meaning, while the affixes perform grammatical functions. Because of these derivative properties, the vocabularies used for these languages are huge. Therefore, pre-processing is a necessary step in NLP tasks for Uyghur, Kazakh and Kirghiz. Morpheme segmentation enabled us to remove the suffixes as the auxiliary unit while retaining the meaningful stem and it reduced the dimension of the feature space present in the keyword extraction task for Uyghur, Kazakh and Kirghiz texts. We transformed the morpheme segmentation task into the problem of labeling the morpheme sequences, and we used the Bi-LSTM network to bidirectionally obtain the position feature information of character sequences. We applied CRF to effectively learn the information of the preceding and following label sequences to build a highly accurate Bi-LSTM_CRF morpheme segmentation model, and we prepared morpheme-based experimental text sets by using this model. Subsequently, we used the stem vectors’ similarity to modify the TextRank algorithm, subsequent to the training of the stem embedding vector using the Doc2vec algorithm, and then we performed a text keyword extraction experiment. In this experiment, the highest F1 scores of 43.8%, 44% and 43.9% were obtained for three datasets. The experimental results show that the morpheme-based approach provides much better results than the word-based approach, which shows the stem vector similarity weighting is an efficient method for the text keyword extraction task, thus proving the efficiency of morpheme sequence for morphologically derivative languages.

Funder

National Natural Science Foundation of China

Xinjiang University of Finance and Economics School Level Scientific Research Foundation Project

Publisher

MDPI AG

Subject

Information Systems

Reference36 articles.

1. Term-weighting Approaches in Automatic Text Retrieval;Salton;Inf. Process. Manag.,1988

2. Keyword extraction: Issues and methods;Firoozeh;Nat. Lang. Eng.,2019

3. Ruhul, A., and Madhusodan, C. (2018, January 21–22). Algorithm for Bengali Keyword Extraction. Proceedings of the International Conference on Bangla Speech and Language Processing, Sylhet, Bangladesh.

4. Ablimit, M., Parhat, S., Hamdulla, A., and Zheng, T.F. (2017, January 12–15). Multilingual Language Processing Tool for Uyghur, Kazak and Kirghiz. Proceedings of the Asia-Pacific Signal and Information Processing Association Annual Summit and Conference, Kuala Lumpur, Malaysia.

5. Research on Uyghur stem extraction method based on hybrid method;Rena;J. Comput. Appl.,2015

Cited by 1 articles. 订阅此论文施引文献 订阅此论文施引文献,注册后可以免费订阅5篇论文的施引文献,订阅后可以查看论文全部施引文献

同舟云学术

1.学者识别学者识别

2.学术分析学术分析

3.人才评估人才评估

"同舟云学术"是以全球学者为主线,采集、加工和组织学术论文而形成的新型学术文献查询和分析系统,可以对全球学者进行文献检索和人才价值评估。用户可以通过关注某些学科领域的顶尖人物而持续追踪该领域的学科进展和研究前沿。经过近期的数据扩容,当前同舟云学术共收录了国内外主流学术期刊6万余种,收集的期刊论文及会议论文总量共计约1.5亿篇,并以每天添加12000余篇中外论文的速度递增。我们也可以为用户提供个性化、定制化的学者数据。欢迎来电咨询!咨询电话:010-8811{复制后删除}0370

www.globalauthorid.com

TOP

Copyright © 2019-2024 北京同舟云网络信息技术有限公司
京公网安备11010802033243号  京ICP备18003416号-3