APPROACHES TO DIMENSIONALITY REDUCTION FOR ULTRA-HIGH DIMENSIONAL MODELS

Author:

Kotlarz KrzysztofORCID,Słomian DawidORCID,Szyda JoannaORCID

Abstract

ABSTRACTThe rapid advancement of high-throughput sequencing technologies has revolutionised genomic research by providing access to large amounts of genomic data. However, the most important disadvantage of using Whole Genome Sequencing (WGS) data is its statistical nature, the so-called p>>n problem. This study aimed to compare three approaches of feature selection allowing for circumventing the p>>n problem, among which one is a novel modification of Supervised Rank Aggregation (SRA). The use of the three methods was demonstrated in the classification of 1,825 individuals representing the 1000 Bull Genomes Project to 5 breeds, based on 11,915,233 SNP genotypes from WGS. In the first step, we applied three feature (i.e. SNP) selection methods: the mechanistic approach(SNP tagging)and two approaches considering biological and statistical contexts by fitting a multiclass logistic regression model followed by either 1-dimensional clustering(1D-SRA)or multi-dimensional feature clustering(MD-SRA)that was originally proposed in this study. Next, we perform the classification based on a Deep Learning architecture composed of Convolutional Neural Networks. The classification quality of the test data set was expressed by macro F1-Score. The SNPs selected bySNP taggingyielded the least satisfactory results (86.87%). Still, this approach offered rapid computing times by focussing only on pairwise LD between SNPs and disregarding the effects of SNP on classification.1D-SRAwas less suitable for ultra-high-dimensional applications due to computational, memory and storage limitations, however, the SNP set selected by this approach provided the best classification quality (96.81%).MD-SRAprovided a very good balance between classification quality (95.12%) and computational efficiency (17x lower analysis time and 14x lower data storage), outperforming other methods. Moreover, unlikeSNP tagging, both SRA-based approaches are universal and not limited to feature selection for genomic data. Our work addresses the urgent need for computational techniques that are both effective and efficient in the analysis and interpretation of large-scale genomic datasets. We offer a model suitable for the classification of ultra-high-dimensional data that implements fusing feature selection and deep learning techniques.

Publisher

Cold Spring Harbor Laboratory

Reference34 articles.

1. Genomics enters the deep learning era

2. Giraud C. Introduction to High-Dimensional Statistics. 2021;

3. Statistical challenges of high-dimensional data

4. Fan J , Li R. Statistical Challenges with High Dimensionality: Feature Selection in Knowledge Discovery. 2006;

5. A survey on feature selection approaches for clustering;Artif Intell Rev,2020

同舟云学术

1.学者识别学者识别

2.学术分析学术分析

3.人才评估人才评估

"同舟云学术"是以全球学者为主线,采集、加工和组织学术论文而形成的新型学术文献查询和分析系统,可以对全球学者进行文献检索和人才价值评估。用户可以通过关注某些学科领域的顶尖人物而持续追踪该领域的学科进展和研究前沿。经过近期的数据扩容,当前同舟云学术共收录了国内外主流学术期刊6万余种,收集的期刊论文及会议论文总量共计约1.5亿篇,并以每天添加12000余篇中外论文的速度递增。我们也可以为用户提供个性化、定制化的学者数据。欢迎来电咨询!咨询电话:010-8811{复制后删除}0370

www.globalauthorid.com

TOP

Copyright © 2019-2024 北京同舟云网络信息技术有限公司
京公网安备11010802033243号  京ICP备18003416号-3