For real: a thorough look at numeric attributes in subgroup discovery-Reference-Cited by-同舟云学术

For real: a thorough look at numeric attributes in subgroup discovery

Published:2020-09-21 Issue:1 Volume:35 Page:158-212
ISSN:1384-5810
Container-title:Data Mining and Knowledge Discovery
language:en
Short-container-title:Data Min Knowl Disc

Author:

Meeng Marvin^ORCID,Knobbe Arno^ORCID

Abstract

AbstractSubgroup discovery (SD) is an exploratory pattern mining paradigm that comes into its own when dealing with large real-world data, which typically involves many attributes, of a mixture of data types. Essential is the ability to deal with numeric attributes, whether they concern the target (a regression setting) or the description attributes (by which subgroups are identified). Various specific algorithms have been proposed in the literature for both cases, but a systematic review of the available options is missing. This paper presents a generic framework that can be instantiated in various ways in order to create different strategies for dealing with numeric data. The bulk of the work in this paper describes an experimental comparison of a considerable range of numeric strategies in SD, where these strategies are organised according to four central dimensions. These experiments are furthermore repeated for both the classification task (target is nominal) and regression task (target is numeric), and the strategies are compared based on the quality of the top subgroup, and the quality and redundancy of the top-k result set. Results of three search strategies are compared: traditional beam search, complete search, and a variant of diverse subgroup set discovery called cover-based subgroup selection. Although there are various subtleties in the outcome of the experiments, the following general conclusions can be drawn: it is often best to determine numeric thresholds dynamically (locally), in a fine-grained manner, with binary splits, while considering multiple candidate thresholds per attribute.

Funder

Leiden University

Publisher

Springer Science and Business Media LLC

Subject

Computer Networks and Communications,Computer Science Applications,Information Systems

Link

https://link.springer.com/content/pdf/10.1007/s10618-020-00703-x.pdf

Reference53 articles.

1. Atzmüller M (2015) Subgroup discovery. Wiley Interdiscip Rev Data Min Knowl Discov 5(1):35–49. https://doi.org/10.1002/widm.1144

2. Atzmüller M, Lemmerich F (2009) Fast subgroup discovery for continuous target concepts. In: Rauch J, Raś ZW, Berka P, Elomaa T (eds) ISMIS 2009, International symposium on methodologies for intelligent systems, Prague, Czech Republic, 14–17 September, 2009, Proceedings, LNCS, vol 5722. Springer, Berlin, pp 35–44. https://doi.org/10.1007/978-3-642-04125-9_7

3. Atzmüller M, Puppe F (2006) SD-map—a fast algorithm for exhaustive subgroup discovery. In: Fürnkranz J, Scheffer T, Spiliopoulou M (eds) PKDD 2006, European conference on principles and practice of knowledge discovery in databases, 18–22 Sept 2006, Proceedings, LNCS, vol 4213. Springer, Berlin, pp 6–17. https://doi.org/10.1007/11871637_6

4. Belfodil A [Aimene], Belfodil A, Kaytoue M (2018) Anytime subgroup discovery in numerical domains with guarantees. In: Berlingerio M, Bonchi F, Gärtner T, Hurley N, Ifrim G (eds) ECML PKDD 2018, European conference on machine learning and principles and practice of knowledge discovery in databases, Dublin, Ireland, 10–14 Sept 2018, proceedings, part II, LNCS, vol 11052. Springer, Cham, pp 500–516. https://doi.org/10.1007/978-3-030-10928-8_30

5. Boley M, Goldsmith BR, Ghiringhelli LM, Vreeken J (2017) Identifying consistent statements about numerical data with dispersion-corrected subgroup discovery. Data Min Knowl Discov 31(5):1391–1418. https://doi.org/10.1007/s10618-017-0520-3

Cited by 11 articles. 订阅此论文施引文献订阅此论文施引文献，注册后可以免费订阅5篇论文的施引文献，订阅后可以查看论文全部施引文献

1. Introducing exceptional growth mining—Analyzing the impact of soil characteristics on on-farm crop growth and yield variability;PLOS ONE;2024-01-29

2. Subgroup Discovery with SD4Py;Communications in Computer and Information Science;2024

3. Efficiently Mining Closed Interval Patterns with Constraint Programming;Lecture Notes in Computer Science;2024

4. Fast Redescription Mining Using Locality-Sensitive Hashing;Lecture Notes in Computer Science;2024

5. Discovery of User Groups Densely Connecting Virtual and Physical Worlds in Event-Based Social Networks;International Journal of Information Technologies and Systems Approach;2023-07-28