Robust subgroup discovery-Reference-Cited by-同舟云学术

Robust subgroup discovery

Published:2022-08-12 Issue:5 Volume:36 Page:1885-1970
ISSN:1384-5810
Container-title:Data Mining and Knowledge Discovery
language:en
Short-container-title:Data Min Knowl Disc

Author:

Proença Hugo M.^ORCID,Grünwald Peter^ORCID,Bäck Thomas^ORCID,van Leeuwen Matthijs^ORCID

Abstract

AbstractWe introduce the problem of robust subgroup discovery, i.e., finding a set of interpretable descriptions of subsets that 1) stand out with respect to one or more target attributes, 2) are statistically robust, and 3) non-redundant. Many attempts have been made to mine either locally robust subgroups or to tackle the pattern explosion, but we are the first to address both challenges at the same time from a global modelling perspective. First, we formulate the broad model class of subgroup lists, i.e., ordered sets of subgroups, for univariate and multivariate targets that can consist of nominal or numeric variables, including traditional top-1 subgroup discovery in its definition. This novel model class allows us to formalise the problem of optimal robust subgroup discovery using the Minimum Description Length (MDL) principle, where we resort to optimal Normalised Maximum Likelihood and Bayesian encodings for nominal and numeric targets, respectively. Second, finding optimal subgroup lists is NP-hard. Therefore, we propose SSD++, a greedy heuristic that finds good subgroup lists and guarantees that the most significant subgroup found according to the MDL criterion is added in each iteration. In fact, the greedy gain is shown to be equivalent to a Bayesian one-sample proportion, multinomial, or t-test between the subgroup and dataset marginal target distributions plus a multiple hypothesis testing penalty. Furthermore, we empirically show on 54 datasets that SSD++ outperforms previous subgroup discovery methods in terms of quality, generalisation on unseen data, and subgroup list size.

Funder

Nederlandse Organisatie voor Wetenschappelijk Onderzoek

Publisher

Springer Science and Business Media LLC

Subject

Computer Networks and Communications,Computer Science Applications,Information Systems

Link

https://link.springer.com/content/pdf/10.1007/s10618-022-00856-x.pdf

Reference96 articles.

1. Aggarwal CC, Bhuiyan MA, Hasan MA (2014) Frequent pattern mining algorithms: A survey. In: Aggarwal CC, Han J (eds) Frequent pattern mining. Springer International Publishing, Berlin, pp 19–64. https://doi.org/10.1007/978-3-319-07821-2_2

2. Agrawal R, Imieliński T, Swami A (1993) Mining association rules between sets of items in large databases. In: Proceedings of the 1993 ACM SIGMOD international conference on Management of data, pp 207–216, https://doi.org/10.1145/170036.170072

3. Akaike H (1998) Information theory and an extension of the maximum likelihood principle. In: Parzen E, Tanabe K, Kitagawa G (eds) Selected papers of Hirotugu Akaike. Springer, New York, pp 199–213. https://doi.org/10.1007/978-1-4612-1694-0_15

4. Alcalá-Fdez J, Fernández A, Luengo J, Derrac J, García S, Sánchez L, Herrera F (2011) KEEL data-mining software tool: data set repository, integration of algorithms and experimental analysis framework. J Multiple-Valued Logic Soft Comput 17:255–287

5. Angelino E, Larus-Stone N, Alabi D, Seltzer M, Rudin C (2017) Learning certifiably optimal rule lists. In: Proceedings of the 23rd ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, Association for Computing Machinery (ACM), New York, NY, USA, KDD ’17, pp 35–44, https://doi.org/10.1145/3097983.3098047

Cited by 10 articles. 订阅此论文施引文献订阅此论文施引文献，注册后可以免费订阅5篇论文的施引文献，订阅后可以查看论文全部施引文献

1. MRI-CE: Minimal rare itemset discovery using the cross-entropy method;Information Sciences;2024-04

2. Subgroup Discovery with SD4Py;Communications in Computer and Information Science;2024

3. Data is Moody: Discovering Data Modification Rules from Process Event Logs;Lecture Notes in Computer Science;2024

4. VLSD—An Efficient Subgroup Discovery Algorithm Based on Equivalence Classes and Optimistic Estimate;Algorithms;2023-05-29

5. Discovering Rule Lists with Preferred Variables;Advances in Intelligent Data Analysis XXI;2023