Author:
Murphy Keefe,López-Pernas Sonsoles,Saqr Mohammed
Abstract
AbstractClustering is a collective term which refers to a broad range of techniques aimed at uncovering patterns and subgroups within data. Interest lies in partitioning heterogeneous data into homogeneous groups, whereby cases within a group are more similar to each other than cases assigned to other groups, without foreknowledge of the group labels. Clustering is also an important component of several exploratory methods, analytical techniques, and modelling approaches and therefore has been practiced for decades in education research. In this context, finding patterns or differences among students enables teachers and researchers to improve their understanding of the diversity of students—and their learning processes—and tailor their supports to different needs. This chapter introduces the theory underpinning dissimilarity-based clustering methods. Then, we focus on some of the most widely-used heuristic dissimilarity-based clustering algorithms; namely, K-means, K-medoids, and agglomerative hierarchical clustering. The K-means clustering algorithm is described including the outline of the arguments of the relevant R functions and the main limitations and practical concerns to be aware of in order to obtain the best performance. We also discuss the related K-medoids algorithm and its own associated concerns and function arguments. We later introduce agglomerative hierarchical clustering and the related R functions while outlining various choices available to practitioners and their implications. Methods for choosing the optimal number of clusters are provided, especially criteria that can guide the choice of clustering solution among multiple competing methodologies—with a particular focus on evaluating solutions obtained using different dissimilarity measures—and not only the choice of the number of clusters K for a given method. All of these issues are demonstrated in detail with a tutorial in R using a real-life educational data set.
Publisher
Springer Nature Switzerland
Reference78 articles.
1. Everitt BS, Landau S, Leese M, Stahl D (2011) Cluster analysis, Fifth. John Wiley & Sons, New York, NY, U.S.A.
2. Hennig C (2015) What are the true clusters? Pattern Recognition Letters 64:53–62
3. Hennig C (2016) Clustering strategy and method selection. In: Hennig C, Meila M, Murtagh F, Rocci R (eds) Handbook of Cluster Analysis. Chapman; Hall/CRC Press, New York, N.Y., U.S.A., pp 703–730
4. MacQueen JB (1967) Some methods for classification and analysis of multivariate observations. In: Cam LML, Neyman J (eds) Proceedings of the Fifth Berkeley Symposium on Mathematical Statistics and Probability. University of California Press, June 21–July 18, 1965; December 27 1965–January 7, 1966, Statistical Laboratory of the University of California, Berkeley, CA, U.S.A., pp 281–297
5. Kaufman L, Rousseeuw PJ (1990) Partitioning around medoids (program PAM). In: Kaufman L, Rousseeuw PJ (eds) Finding Groups in Data: An Introduction to Cluster Analysis. John Wiley & Sons, New York, NY, U.S.A., pp 68–125