Abstract
Abstract
Dimensionality reduction methods are fundamental to the exploration and visualisation of large data sets. Basic requirements for unsupervised data exploration are flexibility and scalability. However, current methods have computational limitations that restrict our ability to explore data structures to the lower range of scales. We focus on t-SNE and propose a chunk-and-mix protocol that enables the parallel implementation of this algorithm, as well as a self-adaptive parametric scheme that facilitates its parametric configuration. As a proof of concept, we present the pt-SNE algorithm, a parallel version of Barnes-Hat-SNE (an
O
(
n
log
n
)
implementation of t-SNE). In pt-SNE, a single free parameter for the size of the neighbourhood, namely the perplexity, modulates the visualisation of the data structure at different scales, from local to global. Thanks to parallelisation, the runtime of the algorithm remains almost independent of the perplexity, which extends the range of scales to be analysed. The pt-SNE converges to a good global embedding comparable to current solutions, although it adds little noise at the local scale. This noise illustrates an unavoidable trade-off between computational speed and accuracy. We expect the same approach to be applicable to faster embedding algorithms than Barnes-Hat-SNE, such as Fast-Fourier Interpolation-based t-SNE or Uniform Manifold Approximation and Projection, thus extending the state of the art and allowing a more comprehensive visualisation and analysis of data structures.