Affiliation:
1. Xuzhou University of Technology, School of Information Engineering, Xuzhou, 221018, China
2. Xuzhou First People’s
Hospital, Department of Stomatology, Xuzhou, 221002, China
3. University of Jinan, School of Information Science, Jinan, 250024, China
Abstract
Introduction:
Transcription factors are vital biological components that control gene
expression, and their primary biological function is to recognize DNA sequences. As related
research continues, it was found that the specificity of DNA-protein binding has a significant
role in gene expression, regulation, and especially gene therapy. Convolutional Neural Networks (CNNs) have become increasingly popular for predicting DNa-protein-specific binding
sites, but their accuracy in prediction needs to be improved.
Methods:
We proposed a framework for combining multi-Instance Learning (MIL) and a hybrid
neural network named WSHNN. First, we utilized sliding windows to split the DNA sequences
into multiple overlapping instances, each instance containing multiple bags. Then, the instances
were encoded using a K-mer encoding. Afterward, the scores of all instances in the same bag
were calculated separately by a hybrid neural network.
Results:
Finally, a fully connected network was utilized as the final prediction for that bag. The
framework could achieve the performances of 90.73% in Pre, 82.77% in Recall, 87.17% in Acc,
0.8657 in F1-score, and 0.7462 in MCC, respectively. In addition, we discussed the performance of K-mer encoding. Compared with other art-of-the-state efforts, the model has better
performance with sequence information.
Conclusion:
From the experimental results, it can be concluded that Bi-directional Long-ShortTerm Memory (Bi-LSTM) can better capture the long-sequence relationships between DNA
sequences (the code and data can be visited at https://github.com/baowz12345/Weak_
Super_Network).
Publisher
Bentham Science Publishers Ltd.