Abstract
AbstractWe introduce the Unbiasing Variational Autoencoder (UVAE), a novel computational framework developed for the integration of unpaired biomedical data streams, with a particular focus on clinical flow cytometry. UVAE effectively addresses the challenges of batch effect correction and data alignment by training a semi-supervised model on partially labeled datasets. This approach enables the simultaneous normalisation and integration of diverse data within a shared latent space. The framework is implemented in Python with a descriptive interface for the specification and incorporation of multiple, partially overlapping data series. UVAE employs a probabilistic model for batch effect normalisation, with a generative capacity for unbiased data reconstruction and inference from heterogeneous samples. Its training process strategically balances class contents during various stages, ensuring accurate representation in statistical analyses. The model’s convergence is achieved through a stable, non-adversarial training mechanism, complemented by an automated selection of hyper-parameters via Bayesian optimization. We quantitatively validate the performance of UVAE’s constituent components and consequently apply it to the real problem of integrating heterogeneous clinical flow cytometry data collected from COVID-19 patients. We show that the alignment process enhances the statistical signal of cell types associated with severity and enables clustering of subpopulations without the impediment of batch effects. Finally, we demonstrate that homogeneous data generated by UVAE can be used to improve the performance of longitudinal regression for predicting peak disease severity from temporal patient samples.AvailabilityFramework is available athttps://github.com/mikephn/UVAE. Benchmarking and clinical data with processing scripts will be made available upon completing peer review.
Publisher
Cold Spring Harbor Laboratory