Affiliation:
1. Inner Mongolia Agricultural University
2. Chinese Academy of Sciences
3. The Affiliated Traditional Chinese Medicine Hospital of Southwest Medical University
4. University of Electronic Science and Technology of China
Abstract
Abstract
The purpose of feature selection in protein sequence recognition problems is to select the optimal feature set and use it as training input for classifiers and discover key sequence features of specific proteins. In the feature selection process, relevant features associated with the target task will be retained, and irrelevant and redundant features will be removed. Therefore, in an ideal state, a feature combination with smaller feature dimensions and higher performance indicators is desired. This paper proposes an algorithm called IIFS2.0 based on the cache elimination strategy, which takes the local optimal combination of cached feature subsets as a breakthrough point. It searches for a new feature combination method through the cache elimination strategy to avoid the drawbacks of human factors and excessive reliance on feature sorting results. We validated and analyzed its effectiveness on the protein dataset, demonstrating that IIFS2.0 significantly reduces the dimensionality of feature combinations while also improving various evaluation indicators. In addition, we provide IIFS2.0 on http://112.124.26.17:8006/ for researchers to use.
Publisher
Research Square Platform LLC
Reference58 articles.
1. Liu, M., et al., Geometric Deep Learning for Drug Discovery. Expert Systems with Applications, 2023: p. 122498.
2. Machine Learning Model for Identifying Antioxidant Proteins Using Features Calculated from Primary Sequences;Ho Thanh;Biology (Basel),2020
3. iDNA-ABF: multi-scale deep biological language learning model for the interpretable prediction of DNA methylations;Jin J;Genome biology,2022
4. Identification of Membrane Protein Types Based Using Hypergraph Neural Network;Lu W;Current Bioinformatics,2023
5. DeepBIO: an automated and interpretable deep-learning platform for high-throughput biological sequence prediction, functional annotation and visualization analysis;Wang R;Nucleic Acids Research,2023