Abstract
AbstractThe emerging field of Genome-NLP (Natural Language Processing) aims to analyse biological sequence data using machine learning (ML), offering significant advancements in data-driven diagnostics. Three key challenges exist in Genome-NLP. First, long biomolecular sequences require “tokenisation” into smaller subunits, which is non-trivial since many biological “words” remain unknown. Second, ML methods are highly nuanced, reducing interoperability and usability. Third, comparing models and reproducing results are difficult due to the large volume and poor quality of biological data.To tackle these challenges, we developed the first automated Genome-NLP workflow that integrates feature engineering and ML techniques. The workflow is designed to be species and sequence agnostic. In this workflow: a) We introduce a new transformer-based model for genomes calledgenomicBERT, which empirically tokenises sequences while retaining biological context. This approach minimises manual preprocessing, reduces vocabulary sizes, and effectively handles out-of-vocabulary “words”. (b) We enable the comparison of ML model performance even in the absence of raw data.To facilitate widespread adoption and collaboration, we have madegenomicBERTavailable as part of the publicly accessible conda package calledgenomeNLP. We have successfully demonstrated the application ofgenomeNLPon multiple case studies, showcasing its effectiveness in the field of Genome-NLP.HighlightsWe provide a comprehensive classification of genomic data tokenisation and representation approaches for ML applications along with their pros and cons.We infer k-mers directly from the data and handle out-of-vocabulary words. At the same time, we achieve a significantly reduced vocabulary size compared to the conventional k-mer approach reducing the computational complexity drastically.Our method is agnostic to species or biomolecule type as it is data-driven.We enable comparison of trained model performance without requiring original input data, metadata or hyperparameter settings.We present the first publicly available, high-level toolkit that infers the grammar of genomic data directly through artificial neural networks.Preprocessing, hyperparameter sweeps, cross validations, metrics and interactive visualisations are automated but can be adjusted by the user as needed.
Publisher
Cold Spring Harbor Laboratory
Cited by
1 articles.
订阅此论文施引文献
订阅此论文施引文献,注册后可以免费订阅5篇论文的施引文献,订阅后可以查看论文全部施引文献