Ten quick tips for sequence-based prediction of protein properties using machine learning-Reference-Cited by-同舟云学术

Ten quick tips for sequence-based prediction of protein properties using machine learning

Published:2022-12-01 Issue:12 Volume:18 Page:e1010669
ISSN:1553-7358
Container-title:PLOS Computational Biology
language:en
Short-container-title:PLoS Comput Biol

Author:

Hou Qingzhen^ORCID,Waury Katharina^ORCID,Gogishvili Dea^ORCID,Feenstra K. Anton^ORCID

Abstract

The ubiquitous availability of genome sequencing data explains the popularity of machine learning-based methods for the prediction of protein properties from their amino acid sequences. Over the years, while revising our own work, reading submitted manuscripts as well as published papers, we have noticed several recurring issues, which make some reported findings hard to understand and replicate. We suspect this may be due to biologists being unfamiliar with machine learning methodology, or conversely, machine learning experts may miss some of the knowledge needed to correctly apply their methods to proteins. Here, we aim to bridge this gap for developers of such methods. The most striking issues are linked to a lack of clarity: how were annotations of interest obtained; which benchmark metrics were used; how are positives and negatives defined. Others relate to a lack of rigor: If you sneak in structural information, your method is not sequence-based; if you compare your own model to “state-of-the-art,” take the best methods; if you want to conclude that some method is better than another, obtain a significance estimate to support this claim. These, and other issues, we will cover in detail. These points may have seemed obvious to the authors during writing; however, they are not always clear-cut to the readers. We also expect many of these tips to hold for other machine learning-based applications in biology. Therefore, many computational biologists who develop methods in this particular subject will benefit from a concise overview of what to avoid and what to do instead.

Publisher

Public Library of Science (PLoS)

Subject

Computational Theory and Mathematics,Cellular and Molecular Neuroscience,Genetics,Molecular Biology,Ecology,Modeling and Simulation,Ecology, Evolution, Behavior and Systematics

Reference57 articles.

1. Machine learning in bioinformatics;P Larrañaga;Brief Bioinform,2006

2. Setting the standards for machine learning in biology;DT Jones;Nat Rev Mol Cell Biol,2019

3. A guide to machine learning for biologists;JG Greener;Nat Rev Mol Cell Biol,2021

4. Ten quick tips for deep learning in biology;BD Lee;PLoS Comput Biol,2022

5. Opportunities and obstacles for deep learning in biology and medicine.;T Ching;J R Soc Interface,2018

Cited by 8 articles. 订阅此论文施引文献订阅此论文施引文献，注册后可以免费订阅5篇论文的施引文献，订阅后可以查看论文全部施引文献

1. Improving viral annotation with artificial intelligence;mBio;2024-09-04

2. Seven quick tips for gene-focused computational pangenomic analysis;BioData Mining;2024-09-03

3. Understanding and Therapeutic Application of Immune Response in Major Histocompatibility Complex (MHC) Diversity Using Multimodal Artificial Intelligence;BioMedInformatics;2024-08-05

4. Pitfalls of machine learning models for protein–protein interaction networks;Bioinformatics;2024-01-10

5. Seq2Phase: language model-based accurate prediction of client proteins in liquid–liquid phase separation;Bioinformatics Advances;2023-12-22