Efficient Outlier Detection in Text Corpus Using Rare Frequency and Ranking-Reference-Cited by-同舟云学术

Efficient Outlier Detection in Text Corpus Using Rare Frequency and Ranking

Published:2020-12-31 Issue:6 Volume:14 Page:1-30
ISSN:1556-4681
Container-title:ACM Transactions on Knowledge Discovery from Data
language:en
Short-container-title:ACM Trans. Knowl. Discov. Data

Author:

Mohotti Wathsala Anupama¹^ORCID,Nayak Richi¹

Affiliation:

1. Queensland University of Technology, Australia

Abstract

Outlier detection in text data collections has become significant due to the need of finding anomalies in the myriad of text data sources. High feature dimensionality, together with the larger size of these document collections, presents a need for developing accurate outlier detection methods with high efficiency. Traditional outlier detection methods face several challenges including data sparseness, distance concentration, and the presence of a larger number of sub-groups when dealing with text data. In this article, we propose to address these issues by developing novel concepts such as presenting documents with the rare document frequency, finding ranking-based neighborhood for similarity computation, and identifying sub-dense local neighborhoods in high dimensions. To improve the proposed primary method based on rare document frequency, we present several novel ensemble approaches using the ranking concept to reduce the false identifications while finding the higher number of true outliers. Extensive empirical analysis shows that the proposed method and its ensemble variations improve the quality of outlier detection in document repositories as well as they are found scalable compared to the relevant benchmarking methods.

Publisher

Association for Computing Machinery (ACM)

Subject

General Computer Science

Link

https://dl.acm.org/doi/pdf/10.1145/3399712

Reference62 articles.

1. OPTICS

Cited by 7 articles. 订阅此论文施引文献订阅此论文施引文献，注册后可以免费订阅5篇论文的施引文献，订阅后可以查看论文全部施引文献

1. Simulated Workbench Design for Characterisation and Selection of Appropriate Outlier Ensemble Algorithm;International Journal of Information System Modeling and Design;2022-12-09

2. Dual-MGAN: An Efficient Approach for Semi-supervised Outlier Detection with Few Identified Anomalies;ACM Transactions on Knowledge Discovery from Data;2022-07-30

3. Machine Learning for Identifying Abusive Content in Text Data;Learning and Analytics in Intelligent Systems;2022

4. Out-of-Category Document Identification Using Target-Category Names as Weak Supervision;2021 IEEE International Conference on Data Mining (ICDM);2021-12

5. Deep neural network for text anomaly detection in SIoT;Computer Communications;2021-10