Predicting Phenotypic Traits Using a Massive RNA-seq Dataset-Reference-Cited by-同舟云学术

Predicting Phenotypic Traits Using a Massive RNA-seq Dataset

Published:2023-12-07 Issue: Volume: Page:
ISSN:
Container-title:
language:
Short-container-title:

Author:

Hadish John Anthony^ORCID,Honaas Loren A.,Ficklin Stephen Patrick^ORCID

Abstract

AbstractTranscriptomic data can be used to predict environmentally impacted phenotypic traits. This type of prediction is particularly useful for monitoring difficult-to-measure phenotypic traits and has become increasingly popular for monitoring high-value agricultural crops and in precision medicine. Despite this increase in popularity, little research has been done on how many samples are required for these models to be accurate, and which normalization should be used. Here we create a massive RNA-seq dataset from publicly availableArabidopsis thalianadata with corresponding measurements for age and tissue type. We use this dataset to determine how many samples are required for accurate model prediction and which normalization method is required. We find that Median Ratios Normalization significantly increases performance when predicting age. We also find that in the case of our dataset, only a few hundred samples are required to predict tissue types, and only a few thousand samples are necessary to accurately predict age. Researchers should consider these results when choosing the number of samples in a transcriptomic experiment and during data-processing.Author SummaryLarge datasets have become ubiquitous in both research and industry, with thousands and sometimes millions of samples being collected for a single project. In biology a prominent new technology is RNA-seq, which can be used to measure the expression level of thousands of genes for a single sample. These measurements are used for a variety of downstream applications, including predicting phenotypic traits (i.e. height, disease, etc.). A number of experiments have attempted to use RNA-seq data to make phenotype predictions with varying success. This is partially due to the small sample size of their experiments. RNA-seq datasets are currently relatively small--only a dozen to a few hundred samples--due to the cost per sample. This is expected to change as the cost of sequencing decreases. In this paper we create a massive conglomerate RNA-seq dataset from publicly availableArabidopsis thalianaRNA-seq data. We use this dataset to determine how many samples are required to accurately predict plant age and tissue type using machine learning models. We also explore the best way to normalize large datasets. Our results show the potential of massive RNA-seq datasets, and can be used to inform experimental design for phenotype prediction.

Publisher

Cold Spring Harbor Laboratory

Reference39 articles.

1. Machine Learning Analysis of RNA-seq Data for Diagnostic and Prognostic Prediction of Colon Cancer

2. An accurate regression of developmental stages for breast cancer based on transcriptomic biomarkers

3. Standard machine learning approaches outperform deep representation learning on phenotype prediction from transcriptomics data

4. Current Achievements and Applications of Transcriptomics in Personalized Cancer Medicine

5. Transcriptome-Based Prediction of Complex Traits in Maize