TAG term weight-based N gram Thesaurus generation for query expansion in information retrieval application-Reference-Cited by-同舟云学术

TAG term weight-based N gram Thesaurus generation for query expansion in information retrieval application

Published:2015-04-27 Issue:4 Volume:41 Page:467-485
ISSN:0165-5515
Container-title:Journal of Information Science
language:en
Short-container-title:Journal of Information Science

Author:

Shaila S.G.¹,Vadivel A.¹

Affiliation:

1. Information Retrieval Group, Department of Computer Applications, National Institute of Technology, Tamil Nadu, India

Abstract

Query expansion is an important task in information retrieval applications that improves the user query and helps in retrieving the relevant documents. In this paper, N gram Thesaurus is constructed from the documents for query expansion. The HTML TAGs in web documents are considered and their syntactical context is understood. Based on the nature, properties and significances, the TAGs are assigned a suitable weight. Later, the term weight is calculated using corresponding TAG weight and term frequency and later updated into the inverted index. All the single terms in the inverted index are updated as Unigrams in the Thesaurus. Further, Bigrams are constructed using Unigrams. Likewise, the rest of the (N + 1) grams are generated using N grams and their weights and later updated into the Thesaurus. During the query session, the user query terms are expanded based on the predicted N grams provided by the Thesaurus that are given as suggestions to the user. The performance of the proposed approach is evaluated using the Clueweb09B, WT10g and GOV2 benchmark dataset. The improvement gain against baseline is considered as an evaluation parameter and the proposed approach has acheved 7.9% gain on ClueWeb09B, 18.3% on WT10g and 29.4% on GOV2 in terms of Mean Average Precision (MAP). We also compared the performance of the proposed approach with two other query expansion approaches, KLDCo and BoCo. The approach achieved 0.574 (+0.236), 0.519 (+0.209), 0.422 (+0.185) and 0.654 (+0.243) gain in terms P@5, P@10, MAP and MRR against baselines.

Publisher

SAGE Publications

Subject

Library and Information Sciences,Information Systems

Link

http://journals.sagepub.com/doi/pdf/10.1177/0165551515581567

Reference25 articles.

1. Corpus Linguistics and the Web

2. Googleology is Bad Science

3. Introduction to Information Retrieval

4. Query Expansion using Lexical-Semantic Relations

Cited by 4 articles. 订阅此论文施引文献订阅此论文施引文献，注册后可以免费订阅5篇论文的施引文献，订阅后可以查看论文全部施引文献

1. Research and Implementation of Automatic Indexing Method of PDF for Digital Publishing;ACM Transactions on Asian and Low-Resource Language Information Processing;2023-03-31

2. Domain-Specific Term Extraction: A Case Study on Greek Maritime Legal Texts;Proceedings of the 12th Hellenic Conference on Artificial Intelligence;2022-09-07

3. Contextual weighting approach to compute term weight in layered vector space model;Journal of Information Science;2019-07-29

4. RENT: Regular Expression and NLP-Based Term Extraction Scheme for Agricultural Domain;Proceedings of the International Conference on Data Engineering and Communication Technology;2016-08-24