What makes a good prediction? Feature importance and beginning to open the black box of machine learning in genetics-Reference-Cited by-同舟云学术

What makes a good prediction? Feature importance and beginning to open the black box of machine learning in genetics

Published:2021-12-04 Issue: Volume: Page:
ISSN:0340-6717
Container-title:Human Genetics
language:en
Short-container-title:Hum Genet

Author:

Musolf Anthony M.,Holzinger Emily R.,Malley James D.,Bailey-Wilson Joan E.

Abstract

AbstractGenetic data have become increasingly complex within the past decade, leading researchers to pursue increasingly complex questions, such as those involving epistatic interactions and protein prediction. Traditional methods are ill-suited to answer these questions, but machine learning (ML) techniques offer an alternative solution. ML algorithms are commonly used in genetics to predict or classify subjects, but some methods evaluate which features (variables) are responsible for creating a good prediction; this is called feature importance. This is critical in genetics, as researchers are often interested in which features (e.g., SNP genotype or environmental exposure) are responsible for a good prediction. This allows for the deeper analysis beyond simple prediction, including the determination of risk factors associated with a given phenotype. Feature importance further permits the researcher to peer inside the black box of many ML algorithms to see how they work and which features are critical in informing a good prediction. This review focuses on ML methods that provide feature importance metrics for the analysis of genetic data. Five major categories of ML algorithms: k nearest neighbors, artificial neural networks, deep learning, support vector machines, and random forests are described. The review ends with a discussion of how to choose the best machine for a data set. This review will be particularly useful for genetic researchers looking to use ML methods to answer questions beyond basic prediction and classification.

Funder

National Human Genome Research Institute

Publisher

Springer Science and Business Media LLC

Subject

Genetics (clinical),Genetics

Link

https://link.springer.com/content/pdf/10.1007/s00439-021-02402-z.pdf

Reference86 articles.

1. Abo Alchamlat S, Farnir F (2017) KNN-MDR: a learning approach for improving interactions mapping performances in genome wide association studies. BMC Bioinform 18:184. https://doi.org/10.1186/s12859-017-1599-7

2. Abu Alfeilat HA, Hassanat ABA, Lasassmeh O, Tarawneh AS, Alhasanat MB, Eyal Salman HS, Prasath VBS (2019) Effects of distance measure choice on K-nearest neighbor classifier performance: a review. Big Data 7:221–248. https://doi.org/10.1089/big.2018.0175

3. Altmann A, Toloşi L, Sander O, Lengauer T (2010) Permutation importance: a corrected feature importance measure. Bioinformatics 26:1340–1347. https://doi.org/10.1093/bioinformatics/btq134