Enhancing Recognition and Interpretation of Functional Phenotypic Sequences through Fine-Tuning Pre-Trained Genomic Models-Reference-Cited by-同舟云学术

Enhancing Recognition and Interpretation of Functional Phenotypic Sequences through Fine-Tuning Pre-Trained Genomic Models

Published:2023-12-07 Issue: Volume: Page:
ISSN:
Container-title:
language:
Short-container-title:

Author:

Du Duo,Zhong Fan,Liu Lei

Abstract

AbstractDecoding high-quality human genomic sequences requires comprehensive analysis of DNA sequence functionality. Through computational and experimental approaches, researchers study the genotype-phenotype relationship and generate important datasets that help unravel complicated genetic blueprints. This study explores the use of deep learning, particularly pre-trained models like DNA_bert_6 and human_gpt2-v1, in interpreting and representing human genome sequences. We meticulously construct multiple datasets linking genotypes and phenotypes to fine-tune pre-trained models for precise DNA sequence classification. Furthermore, we specifically focused on the human endogenous retrovirus (HERV) dataset with commendable classification performance (both binary and multi-classification accuracy and F1 values above 0.935 and 0.888, respectively). We evaluate the influence of sequence length on classification results and analyze the impact of feature extraction in the model’s hidden layers using the HERV dataset. To further understand the phenotype-specific patterns learned by the model, we perform enrichment, pathogenicity and conservation analyzes of specific motifs in the HERV sequence with high average local representation weight (LRAW) scores. Overall, the generated datasets further provide numerous additional genotype-phenotype datasets for evaluating the performance of genomic models. The findings highlight the potential of large models in learning DNA sequence representations, particularly when utilizing the HERV dataset, and provide valuable insights for future research. This work represents an innovative strategy that combines pre-trained model representations with classical omics methods for analyzing the functionality of genome sequences, fostering cross-fertilization between genomics and advanced AI. The source code and data are available athttps://github.com/GeorgeBGM/Genome_Fine-Tuning.

Publisher

Cold Spring Harbor Laboratory

Reference55 articles.

1. The Sequence of the Human Genome

2. Implementing genomic medicine in the clinic: the future is here

3. The Protein-Coding Human Genome: Annotating High-Hanging Fruits;Bioessays,2019

4. SnapShot: Human endogenous retroviruses;Cell,2022