Affiliation:
1. Department of Industrial Engineering, Tel Aviv University, 69978 Tel Aviv, Israel
Abstract
Identifying anomalies in multidimensional data sets is an important yet challenging task in many real-world applications. A special case arises when anomalies are occluded in a small subset of attributes. We propose a new subspace analysis approach, called agglomerative attribute grouping (AAG), that searches for subspaces composed of highly correlative (in the general sense) attributes. Such correlations among attributes can better reflect the behavior of normal observations and hence, can be used to improve the identification of abnormal data samples. The proposed AAG algorithm relies on a generalized multiattribute measure (derived from information theory measures over attributes’ partitions) for evaluating the “information distance” among various subsets of attributes. To determine the set of subspaces, AAG applies a variation of the well-known agglomerative clustering algorithm with the proposed measure as the underlying distance function, whereas in contrast to existing methods, AAG does not require any tuning of parameters. Finally, the set of informative subspaces can be used to improve subspace-based analytical tasks, such as anomaly detection, novelty detection, forecasting, and clustering. Extensive evaluation over real-world data sets demonstrates that (i) in the vast majority of cases, AAG outperforms both classical and state-of-the-art subspace analysis methods when used in anomaly and novelty detection ensembles; (ii) it often generates fewer subspaces with fewer attributes each, thus resulting in faster training times for the anomaly and novelty detection ensemble; and (iii) the generated subspaces can also be useful in other analytical tasks, such as clustering and forecasting. History: Kwok-Leung Tsui served as the senior editor for this article. Funding: This research was partially supported by the Israeli Ministry of Economy (METRO 450 Consortium within the frame of MAGNET program) as well as by the Koret foundation grant for Smart Cities and Digital Living 2030. Data Ethics & Reproducibility Note: The code capsule is available on Code Ocean at https://codeocean.com/capsule/2526218/tree/v1 and in the e-Companion to this article (available at https://doi.org/10.1287/ijds.2023.0027 ).
Publisher
Institute for Operations Research and the Management Sciences (INFORMS)