Abstract
AbstractAnalysis of high-dimensional datasets often involves usage of summary statistics, one of which is the correlation coefficient. These values are then used to inform downstream analysis, whether in feature selection or in subsequent construction of networks and heatmaps. Condensing pairwise scatterplots into these singular values however, often results in a loss of information. Originally proposed by F. J. Anscombe in his famous ‘Anscombe’s Quartet,’ this phenomenon has been canonically used to demonstrate the importance of plotting and the limitations of summary statistics such as correlation or variance [F.J. Anscombe, (1973)American Statistician. 27 (1), 17-21]. While numerous methods exist for the generation of visually distinct datasets that share similar summary statistics, the converse has not been extensively studied. To address this gap, we propose ICLUST (Image CLUSTering), an image classifier tool that can visually distinguish correlations with similar summary statistics in simulations and identify meaningful clusters in real data. Such a tool can potentially benefit those performing exploratory analysis or feature selection in a complementary fashion by identifying relationships between variables that traditional summary metrics cannot provide.Significance StatementDistilling large-scale, multidimensional datasets via analysis of pairwise relationships often employs a single value to describe the relationship between variables. However, as demonstrated through simulations, such summarization fails to retain the nuances of the data. Characteristics such as the type of relationship (linear versus nonlinear, etc.) and the spread of the data are commonly lost when using correlations. Here we propose a transfer learning framework, borrowing from image clustering and classification software, to visually classify graphs. We apply our method towards separation of scatterplots with similar correlation statistics but visually distinctive patterns in both simulations and real data, demonstrating its broad applicability.
Publisher
Cold Spring Harbor Laboratory