Author:
Frisby Trevor S.,Baker Shawn James,Marçais Guillaume,Hoang Quang Minh,Kingsford Carl,Langmead Christopher James
Abstract
AbstractWe present Harvestman, a method that takes advantage of hierarchical relationships among the possible biological interpretations and representations of genomic variants to perform automatic feature learning, feature selection, and model building. We demonstrate that Harvestman scales to thousands of genomes comprising more than 84 million variants by processing phase 3 data from the 1000 Genomes Project, the largest publicly available collection of whole genome sequences. Next, using breast cancer data from The Cancer Genome Atlas, we show that Harvestman selects a rich combination of representations that are adapted to the learning task, and performs better than a binary representation of SNPs alone. Finally, we compare Harvestman to existing feature selection methods and demonstrate that our method selects smaller and less redundant feature subsets, while maintaining accuracy of the resulting classifier. The data used is available through either the 1000 Genomes Project or The Cancer Genome Atlas. Access to TCGA data requires the completion of a Data Access Request through the Database of Genotypes and Phenotypes (dbGaP). Binary releases of Harvestman compatible with Linux, Windows, and Mac are available for download at https://github.com/cmlh-gp/Harvestman-public/releases
Publisher
Cold Spring Harbor Laboratory
Reference46 articles.
1. A hierarchical feature and sample selection framework and its application for alzheimer’s disease diagnosis;Scientific Reports,2017
2. Gene Ontology: tool for the unification of biology
3. Bengio, Y. , Courville, A. C. , & Vincent, P. (2012). Unsupervised feature learning and deep learning: A review and new perspectives. CoRR, abs/1206.5538. Retrieved from http://arxiv.org/abs/1206.5538
4. Selection of relevant features and examples in machine learning
5. Choi, S. , Cha, S.-H. , & Tappert, C. (2009, 11). A survey of binary similarity and distance measures. J. Syst. Cybern. Inf., 8.