Abstract
AbstractProtein secondary structure prediction is a subproblem of protein folding. A lightweight algorithm capable of accurately predicting secondary structure from only the protein residue sequence could provide a useful input for tertiary structure prediction, alleviating the reliance on MSA typically seen in today’s best-performing models. Unfortunately, existing datasets for secondary structure prediction are small, creating a bottleneck. We present PS4, a dataset of 18,731 non-redundant protein chains and their respective secondary structure labels. Each chain is identified, and the dataset is also non-redundant against other secondary structure datasets commonly seen in the literature. We perform ablation studies by training secondary structure prediction algorithms on the PS4 training set, and obtain state-of-the-art accuracy on the CB513 test set in zero shots.
Publisher
Cold Spring Harbor Laboratory