To Clean or Not to Clean-Reference-Cited by-同舟云学术

To Clean or Not to Clean

Published:2018-12-31 Issue:4 Volume:10 Page:1-25
ISSN:1936-1955
Container-title:Journal of Data and Information Quality
language:en
Short-container-title:J. Data and Information Quality

Author:

Roy Dwaipayan¹^ORCID,Mitra Mandar¹,Ganguly Debasis²

Affiliation:

1. Indian Statistical Institute, Kolkata, India

2. IBM Research, Dublin, Ireland

Abstract

Web document collections such as WT10G, GOV2, and ClueWeb are widely used for text retrieval experiments. Documents in these collections contain a fair amount of non-content-related markup in the form of tags, hyperlinks, and so on. Published articles that use these corpora generally do not provide specific details about how this markup information is handled during indexing. However, this question turns out to be important: Through experiments, we find that including or excluding metadata in the index can produce significantly different results with standard IR models. More importantly, the effect varies across models and collections. For example, metadata filtering is found to be generally beneficial when using BM25, or language modeling with Dirichlet smoothing, but can significantly reduce retrieval effectiveness if language modeling is used with Jelinek-Mercer smoothing. We also observe that, in general, the performance differences become more noticeable as the amount of metadata in the test collections increase. Given this variability, we believe that the details of document preprocessing are significant from the point of view of reproducibility. In a second set of experiments, we also study the effect of preprocessing on query expansion using RM3. In this case, once again, we find that it is generally better to remove markup before using documents for query expansion.

Publisher

Association for Computing Machinery (ACM)

Subject

Information Systems and Management,Information Systems

Link

https://dl.acm.org/doi/pdf/10.1145/3242180

Reference77 articles.

1. Probabilistic models of information retrieval based on measuring the divergence from randomness

2. A Probabilistic Fusion Framework

3. Report on the SIGIR 2015 Workshop on Reproducibility, Inexplicability, and Generalizability of Results (RIGOR)

4. Improvements that don't add up

5. Time-Aware Authorship Attribution for Short Text Streams

Cited by 6 articles. 订阅此论文施引文献订阅此论文施引文献，注册后可以免费订阅5篇论文的施引文献，订阅后可以查看论文全部施引文献

1. A Query-Based Weighted Document Partitioning Method for Load Balancing in Search Engines;Wireless Personal Communications;2023-03-20

2. ir_metadata: An Extensible Metadata Schema for IR Experiments;Proceedings of the 45th International ACM SIGIR Conference on Research and Development in Information Retrieval;2022-07-06

3. The Problem of Semantic Shift in Longitudinal Monitoring of Social Media;14th ACM Web Science Conference 2022;2022-06-26

4. Evaluating Elements of Web-Based Data Enrichment for Pseudo-relevance Feedback Retrieval;Lecture Notes in Computer Science;2021

5. A Review of Tools and Techniques for Preprocessing of Textual Data;Computational Methods and Data Engineering;2020-08-20