DistilProtBert: a distilled protein language model used to distinguish between real proteins and their randomly shuffled counterparts-Reference-Cited by-同舟云学术

DistilProtBert: a distilled protein language model used to distinguish between real proteins and their randomly shuffled counterparts

Published:2022-09-01 Issue:Supplement_2 Volume:38 Page:ii95-ii98
ISSN:1367-4803
Container-title:Bioinformatics
language:en
Short-container-title:

Author:

Geffen Yaron¹,Ofran Yanay¹,Unger Ron¹

Affiliation:

1. The Mina & Everard Goodman Faculty of Life Sciences, Bar-Ilan University , Ramat-Gan 5290002, Israel

Abstract

Abstract Summary Recently, deep learning models, initially developed in the field of natural language processing (NLP), were applied successfully to analyze protein sequences. A major drawback of these models is their size in terms of the number of parameters needed to be fitted and the amount of computational resources they require. Recently, ‘distilled’ models using the concept of student and teacher networks have been widely used in NLP. Here, we adapted this concept to the problem of protein sequence analysis, by developing DistilProtBert, a distilled version of the successful ProtBert model. Implementing this approach, we reduced the size of the network and the running time by 50%, and the computational resources needed for pretraining by 98% relative to ProtBert model. Using two published tasks, we showed that the performance of the distilled model approaches that of the full model. We next tested the ability of DistilProtBert to distinguish between real and random protein sequences. The task is highly challenging if the composition is maintained on the level of singlet, doublet and triplet amino acids. Indeed, traditional machine-learning algorithms have difficulties with this task. Here, we show that DistilProtBert preforms very well on singlet, doublet and even triplet-shuffled versions of the human proteome, with AUC of 0.92, 0.91 and 0.87, respectively. Finally, we suggest that by examining the small number of false-positive classifications (i.e. shuffled sequences classified as proteins by DistilProtBert), we may be able to identify de novo potential natural-like proteins based on random shuffling of amino acid sequences. Availability and implementation https://github.com/yarongef/DistilProtBert.

Funder

DSI of Bar-Ilan University

Publisher

Oxford University Press (OUP)

Subject

Computational Mathematics,Computational Theory and Mathematics,Computer Science Applications,Molecular Biology,Biochemistry,Statistics and Probability

Link

https://academic.oup.com/bioinformatics/article-pdf/38/Supplement_2/ii95/49886166/btac474.pdf

Reference32 articles.

1. Unified rational protein engineering with sequence-based deep representation learning;Alley;Nat. Methods,2019

2. DeepLoc: prediction of protein subcellular localization using deep learning;Almagro Armenteros;Bioinformatics,2017

3. Accurate prediction of protein structures and interactions using a three-track neural network;Baek;Science,2021

4. ProteinBERT: a universal deep-learning model of protein sequence and function;Brandes;Bioinformatics,2022

5. Evaluation and improvement of multiple sequence methods for protein secondary structure prediction;Cuff;Proteins,1999

Cited by 20 articles. 订阅此论文施引文献订阅此论文施引文献，注册后可以免费订阅5篇论文的施引文献，订阅后可以查看论文全部施引文献

1. DeepNeuropePred: A robust and universal tool to predict cleavage sites from neuropeptide precursors by protein language model;Computational and Structural Biotechnology Journal;2024-12

2. A systematic evaluation of the language-of-viral-escape model using multiple machine learning frameworks;2024-09-08

3. PowerNovo: de novo peptide sequencing via tandem mass spectrometry using an ensemble of transformer and BERT models;Scientific Reports;2024-07-01

4. EuDockScore: euclidean graph neural networks for scoring protein-protein interfaces;2024-06-06

5. Application of Transformers in Cheminformatics;Journal of Chemical Information and Modeling;2024-05-30