Statistical Learning of Large-Scale Genetic Data: How to Run a Genome-Wide Association Study of Gene-Expression Data Using the 1000 Genomes Project Data
-
Published:2023-07-01
Issue:
Volume:
Page:
-
ISSN:1867-1764
-
Container-title:Statistics in Biosciences
-
language:en
-
Short-container-title:Stat Biosci
Author:
Sugolov AntonORCID, Emmenegger Eric, Paterson Andrew D.ORCID, Sun LeiORCID
Abstract
AbstractTeaching statistics through engaging applications to contemporary large-scale datasets is essential to attracting students to the field. To this end, we developed a hands-on, week-long workshop for senior high-school or junior undergraduate students, without prior knowledge in statistical genetics but with some basic knowledge in data science, to conduct their own genome-wide association study (GWAS). The GWAS was performed for open source gene expression data, using publicly available human genetics data. Assisted by a detailed instruction manual, students were able to obtain $$\sim$$
∼
1.4 million p-values from a real scientific study, within several days. This early motivation kept students engaged in learning the theories that support their results, including regression, data visualization, results interpretation, and large-scale multiple hypothesis testing. To further their learning motivation by emphasizing the personal connection to this type of data analysis, students were encouraged to make short presentations about how GWAS has provided insights into the genetic basis of diseases that are present in their friends or families. The appended open source, step-by-step instruction manual includes descriptions of the datasets used, the software needed, and results from the workshop. Additionally, scripts used in the workshop are archived on Github and Zenodo to further enhance reproducible research and training.
Funder
National Sciences and Engineering Research Council Canadian Institutes of Health Research Data Sciences Institute
Publisher
Springer Science and Business Media LLC
Subject
Biochemistry, Genetics and Molecular Biology (miscellaneous),Statistics and Probability
Reference51 articles.
1. Abdi H, Williams LJ (2010) Principal component analysis. Wiley interdisciplinary reviews: computational statistics 2(4):433–459 2. Auton A, Abecasis GR, Altshuler DM, Durbin RM, Bentley DR, Chakravarti A, Clark AG, Donnelly P, Eichler EE, Flicek P, Gabriel SB, Gibbs RA, Green ED, Hurles ME, Knoppers BM, Korbel JO, Lander ES, Lee C, Lehrach H, Gravel S, (2015) A global reference for human genetic variation. Nature 526(7571): 68–74. 10.1038/nature15393 3. Boughton AP, Welch RP, Flickinger M, VandeHaar P, Taliun D, Abecasis GR, Boehnke M (2021) ‘LocusZoom.js: interactive and embeddable visualization of genetic association study results’, Bioinformatics . https://doi.org/10.1093/bioinformatics/btab186 4. Buniello A, MacArthur JAL, Cerezo M, Harris LW, Hayhurst J, Malangone C, McMahon A, Morales J, Mountjoy E, Sollis E et al (2019) The nhgri-ebi gwas catalog of published genome-wide association studies, targeted arrays and summary statistics 2019. Nucleic acids research 47(D1):D1005–D1012 5. Bycroft C, Freeman C, Petkova D, Band G, Elliott LT, Sharp K, Motyer A, Vukcevic D, Delaneau O, O’Connell J, Cortes A (2018) The UK Biobank resource with deep phenotyping and genomic data. Nature 562(7726):203–209
Cited by
1 articles.
订阅此论文施引文献
订阅此论文施引文献,注册后可以免费订阅5篇论文的施引文献,订阅后可以查看论文全部施引文献
|
|