Abstract
AbstractAutomated machine learning (AutoML) solutions can bridge the gap between new computational advances and their real-world applications by enabling experimental scientists to build trustworthy models. We consider the design of such an AutoML tool for developing peptide bioactivity predictors. We analyse different design choices concerning data acquisition and negative class definition, homology partitioning for the construction of independent evaluation sets, the use of protein language models as a general sequence representation method, and model selection and hyperparameter optimisation. We have found that the definition of the negative class has a significant impact in the perceived performance of the models with differences up to 40%; the use of homology partitioning leads to more strict evaluation with up to 50% drops in perceived performance; the use of protein language models achieves state-of-the-art performance across different tasks; and the introduction of hyperparameter optimisation enables simpler machine learning models to perform similarly to more complex architectures. Finally, we integrate the conclusions drawn from this study into AutoPeptideML, an end-to-end, user-friendly application that enables experimental researchers to build trustworthy models, facilitating compliance with community guidelines. The source code, documentation, and data are available athttps://github.com/IBM/AutoPeptideMLand a dedicated web-server is available athttp://peptide.ucd.ie/AutoPeptideML.
Publisher
Cold Spring Harbor Laboratory