Effect of tokenization on transformers for biological sequences-Reference-Cited by-同舟云学术

Effect of tokenization on transformers for biological sequences

Published:2024-03-29 Issue:4 Volume:40 Page:
ISSN:1367-4803
Container-title:Bioinformatics
language:en
Short-container-title:

Author:

Dotan Edo¹²,Jaschek Gal³,Pupko Tal²^ORCID,Belinkov Yonatan¹

Affiliation:

1. The Henry and Marilyn Taub Faculty of Computer Science, Technion – Israel Institute of Technology , Haifa 3200003, Israel

2. The Shmunis School of Biomedicine and Cancer Research, George S. Wise Faculty of Life Sciences, Tel Aviv University , Tel Aviv 69978, Israel

3. Department of Genetics, Yale University School of Medicine , New Haven, CT 06510, United States

Abstract

Abstract Motivation Deep-learning models are transforming biological research, including many bioinformatics and comparative genomics algorithms, such as sequence alignments, phylogenetic tree inference, and automatic classification of protein functions. Among these deep-learning algorithms, models for processing natural languages, developed in the natural language processing (NLP) community, were recently applied to biological sequences. However, biological sequences are different from natural languages, such as English, and French, in which segmentation of the text to separate words is relatively straightforward. Moreover, biological sequences are characterized by extremely long sentences, which hamper their processing by current machine-learning models, notably the transformer architecture. In NLP, one of the first processing steps is to transform the raw text to a list of tokens. Deep-learning applications to biological sequence data mostly segment proteins and DNA to single characters. In this work, we study the effect of alternative tokenization algorithms on eight different tasks in biology, from predicting the function of proteins and their stability, through nucleotide sequence alignment, to classifying proteins to specific families. Results We demonstrate that applying alternative tokenization algorithms can increase accuracy and at the same time, substantially reduce the input length compared to the trivial tokenizer in which each character is a token. Furthermore, applying these tokenization algorithms allows interpreting trained models, taking into account dependencies among positions. Finally, we trained these tokenizers on a large dataset of protein sequences containing more than 400 billion amino acids, which resulted in over a 3-fold decrease in the number of tokens. We then tested these tokenizers trained on large-scale data on the above specific tasks and showed that for some tasks it is highly beneficial to train database-specific tokenizers. Our study suggests that tokenizers are likely to be a critical component in future deep-network analysis of biological sequence data. Availability and implementation Code, data, and trained tokenizers are available on https://github.com/technion-cs-nlp/BiologicalTokenizers.

Funder

Azrieli Foundation Early Career Faculty Fellowship

Tel Aviv University

Israel Science Foundation

Publisher

Oxford University Press (OUP)

Link

https://academic.oup.com/bioinformatics/advance-article-pdf/doi/10.1093/bioinformatics/btae196/57226869/btae196.pdf

Reference55 articles.

1. Short K-Mer abundance profiles yield robust machine learning features and accurate classifiers for RNA viruses;Alam;PLoS One,2020

2. A review of deep learning applications in human genomics using next-generation sequencing data;Alharbi;Hum Genomics,2022

3. Basic local alignment search tool;Altschul;J Mol Biol,1990

4. The SCOP database in 2020: expanded classification of representative family and superfamily domains of known protein structures;Andreeva;Nucleic Acids Res,2020

5. ProteinBERT: a universal deep-learning model of protein sequence and function.;Brandes;Bioinformatics,2022

Cited by 3 articles. 订阅此论文施引文献订阅此论文施引文献，注册后可以免费订阅5篇论文的施引文献，订阅后可以查看论文全部施引文献

1. Range-limited Heaps’ law for functional DNA words in the human genome;Journal of Theoretical Biology;2024-09

2. Understanding the natural language of DNA using encoder–decoder foundation models with byte-level precision;Bioinformatics Advances;2024

3. A study of the impact of scientific collaboration on the application of Large Language Model;AIMS Mathematics;2024