On the Best Way to Cluster NCI-60 Molecules-Reference-Cited by-同舟云学术

On the Best Way to Cluster NCI-60 Molecules

Published:2023-03-08 Issue:3 Volume:13 Page:498
ISSN:2218-273X
Container-title:Biomolecules
language:en
Short-container-title:Biomolecules

Author:

Hernández-Hernández Saiveth¹,Ballester Pedro J.²^ORCID

Affiliation:

1. Cancer Research Center of Marseille (INSERM U1068, Institut Paoli-Calmettes, Aix-Marseille Université UM105, CNRS UMR7258), 13009 Marseille, France

2. Department of Bioengineering, Imperial College London, London SW7 2AZ, UK

Abstract

Machine learning-based models have been widely used in the early drug-design pipeline. To validate these models, cross-validation strategies have been employed, including those using clustering of molecules in terms of their chemical structures. However, the poor clustering of compounds will compromise such validation, especially on test molecules dissimilar to those in the training set. This study aims at finding the best way to cluster the molecules screened by the National Cancer Institute (NCI)-60 project by comparing hierarchical, Taylor–Butina, and uniform manifold approximation and projection (UMAP) clustering methods. The best-performing algorithm can then be used to generate clusters for model validation strategies. This study also aims at measuring the impact of removing outlier molecules prior to the clustering step. Clustering results are evaluated using three well-known clustering quality metrics. In addition, we compute an average similarity matrix to assess the quality of each cluster. The results show variation in clustering quality from method to method. The clusters obtained by the hierarchical and Taylor–Butina methods are more computationally expensive to use in cross-validation strategies, and both cluster the molecules poorly. In contrast, the UMAP method provides the best quality, and therefore we recommend it to analyze this highly valuable dataset.

Funder

National Council of Sciences and Technology of Mexico

Wolfson Foundation

Royal Society for a Royal Society Wolfson Fellowship

Publisher

MDPI AG

Subject

Molecular Biology,Biochemistry

Link

https://www.mdpi.com/2218-273X/13/3/498/pdf

Reference38 articles.

1. Artificial intelligence for drug response prediction in disease models;Ballester;Brief. Bioinform.,2022

2. Ballester, P.J. (2019). Machine learning for molecular modelling in drug design. Biomolecules, 9.

3. The NCI60 human tumour cell line anticancer drug screen;Shoemaker;Nat. Rev. Cancer,2006

4. The importance of prediction model validation and assessment in obesity and nutrition research;Ivanescu;Int. J. Obes.,2016

5. Most ligand-based classification benchmarks reward memorization rather than generalization;Wallach;J. Chem. Inf. Model.,2018

Cited by 1 articles. 订阅此论文施引文献订阅此论文施引文献，注册后可以免费订阅5篇论文的施引文献，订阅后可以查看论文全部施引文献

1. A practical guide to machine-learning scoring for structure-based virtual screening;Nature Protocols;2023-10-16