Abstract
AbstractIn bioinformatics, machine learning methods have been used to predict features embedded in the sequences. In contrast to what is generally assumed, machine learning approaches can also provide new insights into the underlying biology. Here, we demonstrate this by presenting TargetP 2.0, a novel state of art method to identify N-terminal sorting signals, which direct proteins to the secretory pathway, mitochondria and chloroplasts or other plastids.By examining the strongest signals from the attention layer in the network, we find that the second residue in the protein, i.e. the one following the initial methionine, has a strong influence on the classification. When subsequently examining all targeting peptides, we observe that two-thirds of chloroplast and thylakoid transit peptides have an alanine in position two, but only 20% of other plant proteins. Further highlighting the importance of the second residue, we also note that in fungi and single-celled eukaryotes, less than 30% of the targeting peptides have an amino acid that allows the removal of the N-terminal methionine compared with 60% for the proteins without targeting peptide.TargetP 2.0 is available at http://www.cbs.dtu.dk/services/TargetP-2.0/index.php
Publisher
Cold Spring Harbor Laboratory