An explainable deep learning classifier of bovine mastitis based on whole genome sequence data - circumventing the p>>>n problem

Author:

Kotlarz K.,Mielczarek M.,Biecek P.,Wojdak-Maksymiec K.,Suchocki T.,Topolski P.,Jagusiak W.,Szyda J.ORCID

Abstract

AbstractThe most serious drawback underlying the biological annotation of Whole Genome Sequence data is the p>>n problem, meaning that the number of polymorphic variants (p) is much larger than the number of available phenotypic records (n). Therefore, the major aim of the study was to propose a way to circumvent the problem by combining a LASSO logistic regression model with Deep Learning (DL). That was illustrated by a practical biological problem of classification of cows into mastitis-susceptible or mastitis-resistant, based on genotypes of Single Nucleotide Polymorphisms (SNPs) identified in their WGS. Among several DL architectures proposed via optimisation of DL hyperparameters using the Optuna software, imposed on different SNP sub-sets defined by LASSO logistic regressions with different penalty values, the architecture with 204,642 SNPs was selected as the best one. This architecture was composed of 2 layers with respectively 7 and 46 units per layer as well as respective drop-out rates of 0.210 and 0.358. The classification of the test data set resulted in the AUC=0.750, accuracy=0.650, sensitivity=0.600, and specificity=0.700 was selected as the best model and thus proceeded to genomic and functional annotations. Significant SNPs were selected based on the SHapley Additive exPlanation values transformed to Z-scores to assess the underlying type I-error. These SNPs were annotated to genes. As a final result, a single GO term related to the biological process and thirteen GO terms related to the molecular function were significantly enriched in the gene set that corresponded to the significant SNPs.Author SummaryOur objective is to distinguish between cows that are susceptible and resistant to bovine mastitis by analysing their genomic data. However, we face a significant challenge due to the large number of single nucleotide polymorphisms (SNPs) and limited sample size. To address this challenge, we utilize two methods: feature selection algorithms and deep learning. We experiment with various ways of implementing these techniques and evaluate their performance on a validation set. Our findings reveal that the optimal approach can accurately predict a cow’s susceptibility or resistance status around 65% of the time. Additionally, we employ a technique to identify the most crucial SNPs and their biological functions. Our results indicate that some of these SNPs are related to immune response or protein synthesis pathways, implying that they may affect the cow’s health and productivity.

Publisher

Cold Spring Harbor Laboratory

Reference73 articles.

Cited by 2 articles. 订阅此论文施引文献 订阅此论文施引文献,注册后可以免费订阅5篇论文的施引文献,订阅后可以查看论文全部施引文献

同舟云学术

1.学者识别学者识别

2.学术分析学术分析

3.人才评估人才评估

"同舟云学术"是以全球学者为主线,采集、加工和组织学术论文而形成的新型学术文献查询和分析系统,可以对全球学者进行文献检索和人才价值评估。用户可以通过关注某些学科领域的顶尖人物而持续追踪该领域的学科进展和研究前沿。经过近期的数据扩容,当前同舟云学术共收录了国内外主流学术期刊6万余种,收集的期刊论文及会议论文总量共计约1.5亿篇,并以每天添加12000余篇中外论文的速度递增。我们也可以为用户提供个性化、定制化的学者数据。欢迎来电咨询!咨询电话:010-8811{复制后删除}0370

www.globalauthorid.com

TOP

Copyright © 2019-2024 北京同舟云网络信息技术有限公司
京公网安备11010802033243号  京ICP备18003416号-3