Abstract
Abstract
Word embedding enhances pseudo-relevance feedback query expansion (PRFQE), but training word embedding models needs a long time and is applied on large-size datasets. Moreover, training embedding models need special processing for languages with rich vocabulary and complex morphological structures, such as Arabic. This paper proposes using a representative subset of a dataset to train such models and defines the conditions of representativeness. Using a suitable subset of words to train a word embedding model is effective since it dramatically decreases the training time while preserving the retrieval efficiency. This paper shows that the subset of words that have the prefix ‘AL,’ or the AL-Definite words, represent the TREC2001/2022 dataset, and, for example, the time needed to train the SkipGram word embedding model by the AL-Definite words of this dataset becomes 10% of the time the whole dataset needs. The trained models are used to embed words for different scenarios of Arabic query expansion, and the proposed training method shows effectiveness as it outperforms the ordinary PRFQE by at least 7% Mean Average Precision (MAP) and 14.5% precision improvement at the 10th returned document (P10). Moreover, the improvement over not using the query expansion is 21.7% for MAP and 21.32% for the P10. The results show no significant differences between using different word embedding models for Arabic query expansion.
Publisher
Research Square Platform LLC