TAG term weight-based N gram Thesaurus generation for query expansion in information retrieval application

Author:

Shaila S.G.1,Vadivel A.1

Affiliation:

1. Information Retrieval Group, Department of Computer Applications, National Institute of Technology, Tamil Nadu, India

Abstract

Query expansion is an important task in information retrieval applications that improves the user query and helps in retrieving the relevant documents. In this paper, N gram Thesaurus is constructed from the documents for query expansion. The HTML TAGs in web documents are considered and their syntactical context is understood. Based on the nature, properties and significances, the TAGs are assigned a suitable weight. Later, the term weight is calculated using corresponding TAG weight and term frequency and later updated into the inverted index. All the single terms in the inverted index are updated as Unigrams in the Thesaurus. Further, Bigrams are constructed using Unigrams. Likewise, the rest of the (N + 1) grams are generated using N grams and their weights and later updated into the Thesaurus. During the query session, the user query terms are expanded based on the predicted N grams provided by the Thesaurus that are given as suggestions to the user. The performance of the proposed approach is evaluated using the Clueweb09B, WT10g and GOV2 benchmark dataset. The improvement gain against baseline is considered as an evaluation parameter and the proposed approach has acheved 7.9% gain on ClueWeb09B, 18.3% on WT10g and 29.4% on GOV2 in terms of Mean Average Precision (MAP). We also compared the performance of the proposed approach with two other query expansion approaches, KLDCo and BoCo. The approach achieved 0.574 (+0.236), 0.519 (+0.209), 0.422 (+0.185) and 0.654 (+0.243) gain in terms P@5, P@10, MAP and MRR against baselines.

Publisher

SAGE Publications

Subject

Library and Information Sciences,Information Systems

Cited by 4 articles. 订阅此论文施引文献 订阅此论文施引文献,注册后可以免费订阅5篇论文的施引文献,订阅后可以查看论文全部施引文献

1. Research and Implementation of Automatic Indexing Method of PDF for Digital Publishing;ACM Transactions on Asian and Low-Resource Language Information Processing;2023-03-31

2. Domain-Specific Term Extraction: A Case Study on Greek Maritime Legal Texts;Proceedings of the 12th Hellenic Conference on Artificial Intelligence;2022-09-07

3. Contextual weighting approach to compute term weight in layered vector space model;Journal of Information Science;2019-07-29

4. RENT: Regular Expression and NLP-Based Term Extraction Scheme for Agricultural Domain;Proceedings of the International Conference on Data Engineering and Communication Technology;2016-08-24

同舟云学术

1.学者识别学者识别

2.学术分析学术分析

3.人才评估人才评估

"同舟云学术"是以全球学者为主线,采集、加工和组织学术论文而形成的新型学术文献查询和分析系统,可以对全球学者进行文献检索和人才价值评估。用户可以通过关注某些学科领域的顶尖人物而持续追踪该领域的学科进展和研究前沿。经过近期的数据扩容,当前同舟云学术共收录了国内外主流学术期刊6万余种,收集的期刊论文及会议论文总量共计约1.5亿篇,并以每天添加12000余篇中外论文的速度递增。我们也可以为用户提供个性化、定制化的学者数据。欢迎来电咨询!咨询电话:010-8811{复制后删除}0370

www.globalauthorid.com

TOP

Copyright © 2019-2024 北京同舟云网络信息技术有限公司
京公网安备11010802033243号  京ICP备18003416号-3