How many data clusters are in the Galaxy data set?-Reference-Cited by-同舟云学术

How many data clusters are in the Galaxy data set?

Published:2021-08-26 Issue: Volume: Page:
ISSN:1862-5347
Container-title:Advances in Data Analysis and Classification
language:en
Short-container-title:Adv Data Anal Classif

Author:

Grün Bettina^ORCID,Malsiner-Walli Gertraud^ORCID,Frühwirth-Schnatter Sylvia^ORCID

Abstract

AbstractIn model-based clustering, the Galaxy data set is often used as a benchmark data set to study the performance of different modeling approaches. Aitkin (Stat Model 1:287–304) compares maximum likelihood and Bayesian analyses of the Galaxy data set and expresses reservations about the Bayesian approach due to the fact that the prior assumptions imposed remain rather obscure while playing a major role in the results obtained and conclusions drawn. The aim of the paper is to address Aitkin’s concerns about the Bayesian approach by shedding light on how the specified priors influence the number of estimated clusters. We perform a sensitivity analysis of different prior specifications for the mixtures of finite mixture model, i.e., the mixture model where a prior on the number of components is included. We use an extensive set of different prior specifications in a full factorial design and assess their impact on the estimated number of clusters for the Galaxy data set. Results highlight the interaction effects of the prior specifications and provide insights into which prior specifications are recommended to obtain a sparse clustering solution. A simulation study with artificial data provides further empirical evidence to support the recommendations. A clear understanding of the impact of the prior specifications removes restraints preventing the use of Bayesian methods due to the complexity of selecting suitable priors. Also, the regularizing properties of the priors may be intentionally exploited to obtain a suitable clustering solution meeting prior expectations and needs of the application.

Funder

Austrian Science Fund

Wirtschaftsuniversität Wien

Publisher

Springer Science and Business Media LLC

Subject

Applied Mathematics,Computer Science Applications,Statistics and Probability

Link

https://link.springer.com/content/pdf/10.1007/s11634-021-00461-8.pdf

Reference26 articles.

1. Aitkin M (2001) Likelihood and Bayesian analysis of mixtures. Stat Model 1(4):287–304. https://doi.org/10.1177/1471082x0100100404

2. Aitkin M, Anderson D, Hinde J (1981) Statistical modelling of data on teaching styles. J Royal Stat Soc A 144(4):419–461. https://doi.org/10.2307/2981826

3. Carlin BP, Chib S (1995) Bayesian model choice via Markov chain Monte Carlo methods. J Royal Stat Soc B 57:473–484. https://doi.org/10.1111/j.2517-6161.1995.tb02042.x

4. Crawford SL, DeGroot MH, Kadane JB, Small MJ (1992) Modeling lake-chemistry distributions: approximate Bayesian methods for estimating a finite-mixture model. Technometrics 34(4):441–453. https://doi.org/10.1080/00401706.1992.10484955

5. Dempster AP, Laird NM, Rubin DB (1977) Maximum likelihood from incomplete data via the EM algorithm. J Royal Stat Soc B 39(1):1–38. https://doi.org/10.1111/j.2517-6161.1977.tb01600.x

Cited by 5 articles. 订阅此论文施引文献订阅此论文施引文献，注册后可以免费订阅5篇论文的施引文献，订阅后可以查看论文全部施引文献

1. Mapping Strategies for Declarative Queries over Online Heterogeneous Biological Databases for Intelligent Responses;Proceedings of the 38th ACM/SIGAPP Symposium on Applied Computing;2023-03-27

2. Mapping Declarative Queries to Heterogeneous Biological Databases using Schema Graphs for Intelligent Responses;2022 IEEE Intl Conf on Parallel & Distributed Processing with Applications, Big Data & Cloud Computing, Sustainable Computing & Communications, Social Computing & Networking (ISPA/BDCloud/SocialCom/SustainCom);2022-12

3. Is infinity that far? A Bayesian nonparametric perspective of finite mixture models;The Annals of Statistics;2022-10-01

4. Bayesian Finite Mixture Models;Wiley StatsRef: Statistics Reference Online;2022-02-15

5. Generalized Mixtures of Finite Mixtures and Telescoping Sampling;Bayesian Analysis;2021-01-01