Author:
Mecham Avery,Stephenson Ashlie,Quinteros Badi I.,Salmons Grace,Piccolo Stephen R.
Abstract
AbstractTidyGEO is a Web-based tool for downloading, tidying, and reformatting data series from Gene Expression Omnibus (GEO). As a freely accessible repository with data from over 4 million biological samples across more than 4,000 organisms, GEO provides diverse opportunities for secondary research. Transcriptomic data are most common in GEO, but other measurement types are also prevalent, including DNA methylation levels, genotypes, and chromatin-accessibility profiles. GEO’s diversity and expansiveness present opportunities and challenges. Although scientists may find assay data relevant to a given research question, most analyses require sample annotations, such as a sample’s treatment group, disease subtype, or age. In GEO, such annotations are stored alongside assay data in delimited, text-based files. However, the structure and semantics of the annotations vary widely from one series to another, and many annotations are not useful for analysis purposes. Thus, every GEO series must be tidied before it can be analyzed. Manual approaches may be used, but these are error prone and take time away from other research tasks. Custom computer scripts can be written, but many scientists lack the computational expertise to create such scripts. To address these challenges, we created TidyGEO, which supports essential data-cleaning tasks for sample-level annotations, such as selecting informative columns, renaming columns, splitting or merging columns, standardizing data values, and filtering samples. Additionally, users can integrate annotations with assay data, restructure assay data, and generate code that enables others to reproduce these steps. The source code for TidyGEO is athttps://github.com/srp33/TidyGEO.
Publisher
Cold Spring Harbor Laboratory
Cited by
1 articles.
订阅此论文施引文献
订阅此论文施引文献,注册后可以免费订阅5篇论文的施引文献,订阅后可以查看论文全部施引文献