Abstract
AbstractWe present NEAR, a method based on representation learning that is designed to rapidly identify good sequence alignment candidates from a large protein database. NEAR’s neural embedding model computes per-residue embeddings for target and query protein sequences, and identifies alignment candidates with a pipeline consisting of k-NN search, filtration, and neighbor aggregation. NEAR’s ResNet embedding model is trained using an N-pairs loss function guided by sequence alignments generated by the widely usedHMMER3tool. Benchmarking results reveal improved performance relative to state-of-the-art neural embedding models specifically developed for protein sequences, as well as enhanced speed relative to the alignment-based filtering strategy used inHMMER3’ssensitive alignment pipeline.
Publisher
Cold Spring Harbor Laboratory