Abstract
Unsupervised learning, and more specifically clustering, suffers from the need for expertise in the field to be of use. Researchers must make careful and informed decisions on which algorithm to use with which set of hyperparameters for a given dataset. Additionally, researchers may need to determine the number of clusters in the dataset, which is unfortunately itself an input to most clustering algorithms; all of this before embarking on their actual subject matter work. After quantifying the impact of algorithm and hyperparameter selection, we propose an ensemble clustering framework which can be leveraged with minimal input. It can be used to determine both the number of clusters in the dataset and a suitable choice of algorithm to use for a given dataset. A code library is included in the Conclusions for ease of integration.
Subject
General Pharmacology, Toxicology and Pharmaceutics,General Immunology and Microbiology,General Biochemistry, Genetics and Molecular Biology,General Medicine
Reference14 articles.
1. Detecting shared genetic architecture among multiple phenotypes by hierarchical clustering of gene-level association statistics.;M McGuirl;Genetics.,06 2020
2. An enhanced clustering-based method for determining time-of-day breakpoints through process optimization.;X Song;IEEE Access.,2018
3. Machine learning in the analysis of social problems: The case of global human trafficking.;A Caoli;The British University in Dubai, (Dissertation).,2019
4. Scikit-learn: Machine learning in Python.;F Pedregosa;J. Mach. Learn. Res.,2011
5. fastcluster: Fast hierarchical, agglomerative clustering routines for r and python.;D Müllner;J. Stat. Softw.,2013