Abstract
AbstractTranscriptomic data can be used to predict environmentally impacted phenotypic traits. This type of prediction is particularly useful for monitoring difficult-to-measure phenotypic traits and has become increasingly popular for monitoring high-value agricultural crops and in precision medicine. Despite this increase in popularity, little research has been done on how many samples are required for these models to be accurate, and which normalization should be used. Here we create a massive RNA-seq dataset from publicly availableArabidopsis thalianadata with corresponding measurements for age and tissue type. We use this dataset to determine how many samples are required for accurate model prediction and which normalization method is required. We find that Median Ratios Normalization significantly increases performance when predicting age. We also find that in the case of our dataset, only a few hundred samples are required to predict tissue types, and only a few thousand samples are necessary to accurately predict age. Researchers should consider these results when choosing the number of samples in a transcriptomic experiment and during data-processing.Author SummaryLarge datasets have become ubiquitous in both research and industry, with thousands and sometimes millions of samples being collected for a single project. In biology a prominent new technology is RNA-seq, which can be used to measure the expression level of thousands of genes for a single sample. These measurements are used for a variety of downstream applications, including predicting phenotypic traits (i.e. height, disease, etc.). A number of experiments have attempted to use RNA-seq data to make phenotype predictions with varying success. This is partially due to the small sample size of their experiments. RNA-seq datasets are currently relatively small--only a dozen to a few hundred samples--due to the cost per sample. This is expected to change as the cost of sequencing decreases. In this paper we create a massive conglomerate RNA-seq dataset from publicly availableArabidopsis thalianaRNA-seq data. We use this dataset to determine how many samples are required to accurately predict plant age and tissue type using machine learning models. We also explore the best way to normalize large datasets. Our results show the potential of massive RNA-seq datasets, and can be used to inform experimental design for phenotype prediction.
Publisher
Cold Spring Harbor Laboratory