Learned Text Representation for Amharic Information Retrieval and Natural Language Processing-Reference-Cited by-同舟云学术

Learned Text Representation for Amharic Information Retrieval and Natural Language Processing

Published:2023-03-20 Issue:3 Volume:14 Page:195
ISSN:2078-2489
Container-title:Information
language:en
Short-container-title:Information

Author:

Yeshambel Tilahun¹,Mothe Josiane²^ORCID,Assabie Yaregal³^ORCID

Affiliation:

1. IT Doctorial Program, Addis Ababa University, Addis Ababa P.O. Box 1176, Ethiopia

2. Componsante INSPE, IRIT, UMR5505 CNRS, Université de Toulouse Jean-Jaurès, 118 Rte de Narbonne, F31400 Toulouse, France

3. Department of Computer Science, Addis Ababa University, Addis Ababa P.O. Box 1176, Ethiopia

Abstract

Over the past few years, word embeddings and bidirectional encoder representations from transformers (BERT) models have brought better solutions to learning text representations for natural language processing (NLP) and other tasks. Many NLP applications rely on pre-trained text representations, leading to the development of a number of neural network language models for various languages. However, this is not the case for Amharic, which is known to be a morphologically complex and under-resourced language. Usable pre-trained models for automatic Amharic text processing are not available. This paper presents an investigation on the essence of learned text representation for information retrieval and NLP tasks using word embeddings and BERT language models. We explored the most commonly used methods for word embeddings, including word2vec, GloVe, and fastText, as well as the BERT model. We investigated the performance of query expansion using word embeddings. We also analyzed the use of a pre-trained Amharic BERT model for masked language modeling, next sentence prediction, and text classification tasks. Amharic ad hoc information retrieval test collections that contain word-based, stem-based, and root-based text representations were used for evaluation. We conducted a detailed empirical analysis on the usability of word embeddings and BERT models on word-based, stem-based, and root-based corpora. Experimental results show that word-based query expansion and language modeling perform better than stem-based and root-based text representations, and fastText outperforms other word embeddings on word-based corpus.

Publisher

MDPI AG

Subject

Information Systems

Link

https://www.mdpi.com/2078-2489/14/3/195/pdf

Reference56 articles.

1. Liu, Z., Lin, Y., and Sun, M. (2020). Representation Learning for Natural Language Processing, Springer. Available online: https://link.springer.com/book/10.1007/978-981-15-5573-2.

2. Manning, C., Raghavan, P., and Schutze, H. (2010). Introduction to Information Retrieval, Cambridge University Press. Available online: https://nlp.stanford.edu/IR-book/pdf/irbookprint.pdf.

3. Machine learning in automated text categorization;Sebastiani;ACM Comput. Surv.,2002

4. Tellex, S., Katz, B., Lin, J., Fernandes, A., and Marton, G. (August, January 28). Quantitative evaluation of passage retrieval algorithms for question answering. Proceedings of the 26th Annual international ACM SIGIR Conference on Research and Development in Information Retrieval, Toronto, ON, Canada.

5. Turian, J., Ratinov, L., and Yoshua, B. (2010, January 11–16). Word representations: A simple and general method for semi-supervised learning. Proceedings of the 48th Annual Meeting of the Association for Computational Linguistics, Uppsala, Sweden.

Cited by 4 articles. 订阅此论文施引文献订阅此论文施引文献，注册后可以免费订阅5篇论文的施引文献，订阅后可以查看论文全部施引文献

1. Research on a Mongolian Text to Speech Model Based on Ghost and ILPCnet;Applied Sciences;2024-01-11

2. Recommendation Algorithm Based on Heterogeneous Information Network and Attention Mechanism;Applied Sciences;2023-12-30

3. Development of an NLP-Based Automatic Data Retrieval Model;2023 4th International Conference on Smart Electronics and Communication (ICOSEC);2023-09-20

4. A Deep Unsupervised Representation Learning Architecture using Coefficient of Variation in Hidden Layers;2023 IEEE 9th International Conference on Cloud Computing and Intelligent Systems (CCIS);2023-08-12