Abstract
AbstractWe have developed Hestia, a computational tool that provides a unified framework for introducing similarity correction techniques across different biochemical data types. We propose a new strategy for dividing a dataset into training and evaluation subsets (CCPart) and have compared it against other methods at different thresholds to explore the impact that these choices have on model generalisation evaluation, through the lens of overfitting diagnosis. We have trained molecular language models for protein sequences, DNA sequences, and small molecule string representations (SMILES) on the alternative splitting strategies for training and evaluation subsets. The effect of partitioning strategy and threshold depend both on the specific prediction task and the biochemical data type, for tasks for which homology is important, like enzymatic activity classification, being more sensitive to partitioning strategy than others, like subcellular localization. Overall, the best threshold for small molecules seems to lay between 0.4 and 0.5 in Tanimoto distance, for DNA between 0.4 and 0.5, and for proteins between 0.3 and 0.5, depending on the specific task. Similarity correction algorithms showed significantly better ability to diagnose overfitting in 11 out of 15 datasets with CCPart being more clearly dependent on the threshold than the alternative GraphPart, which showed more instability.Availability and implementationThe source code is freely available athttps://github.com/IBM/Hestia. The tool is also made available through a dedicated web-server athttp://peptide.ucd.ie/Hestia.
Publisher
Cold Spring Harbor Laboratory
Reference48 articles.
1. An introduction to machine learning;Clinical pharmacology & therapeutics,2020
2. Why is tanimoto index an appropriate choice for fingerprint-based similarity calculations?;Journal of cheminformatics,2015
3. The Properties of Known Drugs. 1. Molecular Frameworks
4. Evaluation guidelines for machine learning tools in the chemical sciences;Nature Reviews Chemistry,2022
5. Machine learning validation via rational dataset sampling with astartes;Journal of Open Source Software,2023