RUDEUS, a machine learning classification system to study DNA-Binding proteins-Reference-Cited by-同舟云学术

RUDEUS, a machine learning classification system to study DNA-Binding proteins

Published:2024-02-21 Issue: Volume: Page:
ISSN:
Container-title:
language:
Short-container-title:

Author:

Medina-Ortiz David^ORCID,Cabas-Mora Gabriel^ORCID,Moya-Barría Iván,Soto-Garcia Nicole^ORCID,Uribe-Paredes Roberto^ORCID

Abstract

AbstractDNA-binding proteins are essential in different biological processes, including DNA replication, transcription, packaging, and chromatin remodelling. Exploring their characteristics and functions has become relevant in diverse scientific domains. Computational biology and bioinformatics have assisted in studying DNA-binding proteins, complementing traditional molecular biology methods. While recent advances in machine learning have enabled the integration of predictive systems with bioinformatic approaches, there still needs to be generalizable pipelines for identifying unknown proteins as DNA-binding and assessing the specific type of DNA strand they recognize. In this work, we introduce RUDEUS, a Python library featuring hierarchical classification models designed to identify DNA-binding proteins and assess the specific interaction type, whether single-stranded or double-stranded. RUDEUS has a versatile pipeline capable of training predictive models, synergizing protein language models with supervised learning algorithms, and integrating Bayesian optimization strategies. The trained models have high performance, achieving a precision rate of 95% for DNA-binding identification and 89% for discerning between single-stranded and doublestranded interactions. RUDEUS includes an exploration tool for evaluating unknown protein sequences, annotating them as DNA-binding, and determining the type of DNA strand they recognize. Moreover, a structural bioinformatic pipeline has been integrated into RUDEUS for validating the identified DNA strand through DNA-protein molecular docking. These comprehensive strategies and straightforward implementation demonstrate comparable performance to high-end models and enhance usability for integration into protein engineering pipelines.

Publisher

Cold Spring Harbor Laboratory

Reference48 articles.

1. Akiba, T. , Sano, S. , Yanase, T. , Ohta, T. , and Koyama, M. (2019). Optuna: A next-generation hyperparameter optimization framework. In Proceedings of the 25th ACM SIGKDD international conference on knowledge discovery & data mining, pages 2623–2631.

2. Sentinels of chromatin: chromodomain helicase DNA-binding proteins in development and disease

3. Sentinels of chromatin: chromodomain helicase DNA-binding proteins in development and disease

4. Dp-binder: machine learning model for prediction of dna-binding proteins by fusing evolutionary and physicochemical information;Journal of Computer-Aided Molecular Design,2019

5. Sdbp-pred: Prediction of single-stranded and double-stranded dna-binding proteins by extending consensus sequence and k-segmentation strategies into pssm;Analytical biochemistry,2020

Cited by 3 articles. 订阅此论文施引文献订阅此论文施引文献，注册后可以免费订阅5篇论文的施引文献，订阅后可以查看论文全部施引文献

1. Integrative workflows for the characterization of hydrophobin and cerato-platanin in the marine fungus Paradendryphiella salina;Archives of Microbiology;2024-08-23

2. Protein Language Models and Machine Learning Facilitate the Identification of Antimicrobial Peptides;International Journal of Molecular Sciences;2024-08-14

3. Peptipedia v2.0: A peptide sequence database and user-friendly web platform. A major update;2024-07-16