Affiliation:
1. Complex Systems Modeling Laboratory, Cadi Ayyad University, Marrakesh 40000, Morocco
2. Independent Researcher, 13120 Gardanne, France
Abstract
The efficiency of information retrieval systems primarily depends on the effective representation of documents during query processing. This representation is mainly constructed from relevant document terms identified and selected during their indexing, which are then used for retrieval. However, when documents contain only a few features, such as in short documents, the resulting representation may be information-poor due to a lack of index terms and their lack of relevance. Although document representation can be enriched using techniques like word embeddings, these techniques require large pre-trained datasets, which are often unavailable in the context of domain-specific short documents. This study investigates a new approach to enrich document representation during indexing using generative AI. In the proposed approach, relevant terms extracted from documents and preprocessed for indexing are enriched with a list of key terms suggested by a large language model (LLM). After conducting a small benchmark of several renowned LLM models for key term suggestions from a set of short texts, the GPT-4o model was chosen to experiment with the proposed indexing approach. The findings of this study yielded notable results, demonstrating that generative AI can efficiently fill the knowledge gap in document representation, regardless of the retrieval technique used.
Reference77 articles.
1. Semantic Models for the First-Stage Retrieval: A Comprehensive Review;Guo;ACM Trans. Inf. Syst.,2021
2. Carrillo, M., Villatoro-Tello, E., Lopez-Lopez, A., Eliasmith, C., Montes-y-Gomez, M., and Villasenõr-Pineda, L. (2009, January 26–28). Representing Context Information for Document Retrieval. Proceedings of the International Conference on Flexible Query Answering Systems, Roskilde, Denmark.
3. Efficient Web-Information Retrieval Systems and Web Search Engines: A Survey;Reddy;Int. J. Mech. Eng. Technol.,2017
4. Tang, Y., Zhang, R., Guo, J., Chen, J., Zhu, Z., Wang, S., Yin, D., and Cheng, X. (2023, January 6–10). Semantic-Enhanced Differentiable Search Index Inspired by Learning Strategies. Proceedings of the 29th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, Long Beach, CA, USA.
5. The Use of Ontology in Retrieval: A Study on Textual, Multilingual, and Multimedia Retrieval;Asim;IEEE Access,2019