Learned Text Representation for Amharic Information Retrieval and Natural Language Processing

Author:

Yeshambel Tilahun1,Mothe Josiane2ORCID,Assabie Yaregal3ORCID

Affiliation:

1. IT Doctorial Program, Addis Ababa University, Addis Ababa P.O. Box 1176, Ethiopia

2. Componsante INSPE, IRIT, UMR5505 CNRS, Université de Toulouse Jean-Jaurès, 118 Rte de Narbonne, F31400 Toulouse, France

3. Department of Computer Science, Addis Ababa University, Addis Ababa P.O. Box 1176, Ethiopia

Abstract

Over the past few years, word embeddings and bidirectional encoder representations from transformers (BERT) models have brought better solutions to learning text representations for natural language processing (NLP) and other tasks. Many NLP applications rely on pre-trained text representations, leading to the development of a number of neural network language models for various languages. However, this is not the case for Amharic, which is known to be a morphologically complex and under-resourced language. Usable pre-trained models for automatic Amharic text processing are not available. This paper presents an investigation on the essence of learned text representation for information retrieval and NLP tasks using word embeddings and BERT language models. We explored the most commonly used methods for word embeddings, including word2vec, GloVe, and fastText, as well as the BERT model. We investigated the performance of query expansion using word embeddings. We also analyzed the use of a pre-trained Amharic BERT model for masked language modeling, next sentence prediction, and text classification tasks. Amharic ad hoc information retrieval test collections that contain word-based, stem-based, and root-based text representations were used for evaluation. We conducted a detailed empirical analysis on the usability of word embeddings and BERT models on word-based, stem-based, and root-based corpora. Experimental results show that word-based query expansion and language modeling perform better than stem-based and root-based text representations, and fastText outperforms other word embeddings on word-based corpus.

Publisher

MDPI AG

Subject

Information Systems

Reference56 articles.

1. Liu, Z., Lin, Y., and Sun, M. (2020). Representation Learning for Natural Language Processing, Springer. Available online: https://link.springer.com/book/10.1007/978-981-15-5573-2.

2. Manning, C., Raghavan, P., and Schutze, H. (2010). Introduction to Information Retrieval, Cambridge University Press. Available online: https://nlp.stanford.edu/IR-book/pdf/irbookprint.pdf.

3. Machine learning in automated text categorization;Sebastiani;ACM Comput. Surv.,2002

4. Tellex, S., Katz, B., Lin, J., Fernandes, A., and Marton, G. (August, January 28). Quantitative evaluation of passage retrieval algorithms for question answering. Proceedings of the 26th Annual international ACM SIGIR Conference on Research and Development in Information Retrieval, Toronto, ON, Canada.

5. Turian, J., Ratinov, L., and Yoshua, B. (2010, January 11–16). Word representations: A simple and general method for semi-supervised learning. Proceedings of the 48th Annual Meeting of the Association for Computational Linguistics, Uppsala, Sweden.

Cited by 4 articles. 订阅此论文施引文献 订阅此论文施引文献,注册后可以免费订阅5篇论文的施引文献,订阅后可以查看论文全部施引文献

1. Research on a Mongolian Text to Speech Model Based on Ghost and ILPCnet;Applied Sciences;2024-01-11

2. Recommendation Algorithm Based on Heterogeneous Information Network and Attention Mechanism;Applied Sciences;2023-12-30

3. Development of an NLP-Based Automatic Data Retrieval Model;2023 4th International Conference on Smart Electronics and Communication (ICOSEC);2023-09-20

4. A Deep Unsupervised Representation Learning Architecture using Coefficient of Variation in Hidden Layers;2023 IEEE 9th International Conference on Cloud Computing and Intelligent Systems (CCIS);2023-08-12

同舟云学术

1.学者识别学者识别

2.学术分析学术分析

3.人才评估人才评估

"同舟云学术"是以全球学者为主线,采集、加工和组织学术论文而形成的新型学术文献查询和分析系统,可以对全球学者进行文献检索和人才价值评估。用户可以通过关注某些学科领域的顶尖人物而持续追踪该领域的学科进展和研究前沿。经过近期的数据扩容,当前同舟云学术共收录了国内外主流学术期刊6万余种,收集的期刊论文及会议论文总量共计约1.5亿篇,并以每天添加12000余篇中外论文的速度递增。我们也可以为用户提供个性化、定制化的学者数据。欢迎来电咨询!咨询电话:010-8811{复制后删除}0370

www.globalauthorid.com

TOP

Copyright © 2019-2024 北京同舟云网络信息技术有限公司
京公网安备11010802033243号  京ICP备18003416号-3