Should We Embed in Chemistry? A Comparison of Unsupervised Transfer Learning with PCA, UMAP, and VAE on Molecular Fingerprints-Reference-Cited by-同舟云学术

Should We Embed in Chemistry? A Comparison of Unsupervised Transfer Learning with PCA, UMAP, and VAE on Molecular Fingerprints

Published:2021-08-02 Issue:8 Volume:14 Page:758
ISSN:1424-8247
Container-title:Pharmaceuticals
language:en
Short-container-title:Pharmaceuticals

Author:

Lovrić Mario^ORCID,Đuričić Tomislav^ORCID,Tran Han,Hussain Hussain^ORCID,Lacić Emanuel^ORCID,Rasmussen Morten,Kern Roman^ORCID

Abstract

Methods for dimensionality reduction are showing significant contributions to knowledge generation in high-dimensional modeling scenarios throughout many disciplines. By achieving a lower dimensional representation (also called embedding), fewer computing resources are needed in downstream machine learning tasks, thus leading to a faster training time, lower complexity, and statistical flexibility. In this work, we investigate the utility of three prominent unsupervised embedding techniques (principal component analysis—PCA, uniform manifold approximation and projection—UMAP, and variational autoencoders—VAEs) for solving classification tasks in the domain of toxicology. To this end, we compare these embedding techniques against a set of molecular fingerprint-based models that do not utilize additional pre-preprocessing of features. Inspired by the success of transfer learning in several fields, we further study the performance of embedders when trained on an external dataset of chemical compounds. To gain a better understanding of their characteristics, we evaluate the embedders with different embedding dimensionalities, and with different sizes of the external dataset. Our findings show that the recently popularized UMAP approach can be utilized alongside known techniques such as PCA and VAE as a pre-compression technique in the toxicology domain. Nevertheless, the generative model of VAE shows an advantage in pre-compressing the data with respect to classification accuracy.

Funder

Horizon 2020

Österreichische Forschungsförderungsgesellschaft

Publisher

MDPI AG

Subject

Drug Discovery,Pharmaceutical Science,Molecular Medicine

Link

https://www.mdpi.com/1424-8247/14/8/758/pdf

Reference59 articles.

1. Molecular representations in AI-driven drug discovery: a review and practical guide

2. Neural network and deep-learning algorithms used in QSAR studies: merits and drawbacks