Scaling tree-based automated machine learning to biomedical big data with a feature set selector-Reference-Cited by-同舟云学术

Scaling tree-based automated machine learning to biomedical big data with a feature set selector

Published:2019-06-04 Issue:1 Volume:36 Page:250-256
ISSN:1367-4803
Container-title:Bioinformatics
language:en
Short-container-title:

Author:

Le Trang T¹^ORCID,Fu Weixuan¹^ORCID,Moore Jason H¹^ORCID

Affiliation:

1. Department of Biostatistics, Epidemiology and Informatics, Institute for Biomedical Informatics, University of Pennsylvania, Philadelphia, PA 19104, USA

Abstract

Abstract Motivation Automated machine learning (AutoML) systems are helpful data science assistants designed to scan data for novel features, select appropriate supervised learning models and optimize their parameters. For this purpose, Tree-based Pipeline Optimization Tool (TPOT) was developed using strongly typed genetic programing (GP) to recommend an optimized analysis pipeline for the data scientist’s prediction problem. However, like other AutoML systems, TPOT may reach computational resource limits when working on big data such as whole-genome expression data. Results We introduce two new features implemented in TPOT that helps increase the system’s scalability: Feature Set Selector (FSS) and Template. FSS provides the option to specify subsets of the features as separate datasets, assuming the signals come from one or more of these specific data subsets. FSS increases TPOT’s efficiency in application on big data by slicing the entire dataset into smaller sets of features and allowing GP to select the best subset in the final pipeline. Template enforces type constraints with strongly typed GP and enables the incorporation of FSS at the beginning of each pipeline. Consequently, FSS and Template help reduce TPOT computation time and may provide more interpretable results. Our simulations show TPOT-FSS significantly outperforms a tuned XGBoost model and standard TPOT implementation. We apply TPOT-FSS to real RNA-Seq data from a study of major depressive disorder. Independent of the previous study that identified significant association with depression severity of two modules, TPOT-FSS corroborates that one of the modules is largely predictive of the clinical diagnosis of each individual. Availability and implementation Detailed simulation and analysis code needed to reproduce the results in this study is available at https://github.com/lelaboratoire/tpot-fss. Implementation of the new TPOT operators is available at https://github.com/EpistasisLab/tpot. Supplementary information Supplementary data are available at Bioinformatics online.

Funder

National Institutes of Health

Publisher

Oxford University Press (OUP)

Subject

Computational Mathematics,Computational Theory and Mathematics,Computer Science Applications,Molecular Biology,Biochemistry,Statistics and Probability

Link

http://academic.oup.com/bioinformatics/advance-article-pdf/doi/10.1093/bioinformatics/btz470/28862658/btz470.pdf

Reference40 articles.

1. Random search for hyper-parameter optimization;Bergstra;J. Mach. Learn. Res,2012

2. Polymorphisms in FKBP5 are associated with increased recurrence of depressive episodes and rapid response to antidepressant treatment;Binder;Nat. Genet,2004

Cited by 275 articles. 订阅此论文施引文献订阅此论文施引文献，注册后可以免费订阅5篇论文的施引文献，订阅后可以查看论文全部施引文献

1. Good results from sensor data: Performance of machine learning algorithms for regression problems in chemical sensors;Sensors and Actuators B: Chemical;2024-12

2. CascadeDumpNet: Enhancing open dumpsite detection through deep learning and AutoML integrated dual-stage approach using high-resolution satellite imagery;Remote Sensing of Environment;2024-11

3. GHOST: Graph-based higher-order similarity transformation for classification;Pattern Recognition;2024-11

4. Practical feature filter strategy to machine learning for small datasets in chemistry;Scientific Reports;2024-09-03

5. Machine learning and related approaches in transcriptomics;Biochemical and Biophysical Research Communications;2024-09