Author:
Ataei Sima,Butler Gregory
Abstract
AbstractTransmembrane transport proteins play a vital role in cells’ metabolism by the selective passage of substrates through the cell membrane. Metabolic network reconstruction requires transport reactions that describe the specific substrate transported as well as the metabolic reactions of enzyme catalysis. In this paper, we apply BERT (Bidirectional Encoder Representations from Transformers) language model for protein sequences to predict one of 12 specific substrates. Our UniProt-ICAT-100 dataset is automatically constructed from UniProt using the ChEBI and GO ontologies to identify 4,112 proteins transporting 12 inorganic anion or cation substrates. We classified this dataset using three different models including Logistic Regression with an MCC of 0.81 and accuracy of 97.5%; Feed-forward Neural Networks classifier with an MCC of 0.88 and accuracy of 98.5%. Our third model utilizes a Fine-tuned BERT language model to predict the specific substrate with an MCC of 0.95 and accuracy of 99.3% on an independent test set.
Publisher
Cold Spring Harbor Laboratory