Systematic analysis of dark and camouflaged genes: disease-relevant genes hiding in plain sight-Reference-Cited by-同舟云学术

Systematic analysis of dark and camouflaged genes: disease-relevant genes hiding in plain sight

Published:2019-01-09 Issue: Volume: Page:
ISSN:
Container-title:
language:
Short-container-title:

Author:

Ebbert Mark T. W.^ORCID,Jensen Tanner D.^ORCID,Jansen-West Karen,Sens Jonathon P.^ORCID,Reddy Joseph S.^ORCID,Ridge Perry G.^ORCID,Kauwe John S. K.^ORCID,Belzil Veronique^ORCID,Pregent Luc,Carrasquillo Minerva M.^ORCID,Keene Dirk,Larson Eric,Crane Paul^ORCID,Asmann Yan W.^ORCID,Ertekin-Taner Nilufer^ORCID,Younkin Steven G.,Ross Owen A.^ORCID,Rademakers Rosa^ORCID,Petrucelli Leonard,Fryer John D.^ORCID

Abstract

AbstractBackgroundThe human genome contains ‘dark’ gene regions that cannot be adequately assembled or aligned using standard short-read sequencing technologies, preventing researchers from identifying mutations within these gene regions that may be relevant to human disease. Here, we identify regions that are ‘dark by depth’ (few mappable reads) and others that are ‘camouflaged’ (ambiguous alignment), and we assess how well long-read technologies resolve these regions. We further present an algorithm to resolve most camouflaged regions (including in short-read data) and apply it to the Alzheimer’s Disease Sequencing Project (ADSP; 13142 samples), as a proof of principle.ResultsBased on standard whole-genome lllumina sequencing data, we identified 37873 dark regions in 5857 gene bodies (3635 protein-coding) from pathways important to human health, development, and reproduction. Of the 5857 gene bodies, 494 (8.4%) were 100% dark (142 protein-coding) and 2046 (34.9%) were ≥5% dark (628 protein-coding). Exactly 2757 dark regions were in protein-coding exons (CDS) across 744 genes. Long-read sequencing technologies from 10x Genomics, PacBio, and Oxford Nanopore Technologies reduced dark CDS regions to approximately 45.1%, 33.3%, and 18.2% respectively. Applying our algorithm to the ADSP, we rescued 4622 exonic variants from 501 camouflaged genes, including a rare, ten-nucleotide frameshift deletion in CR1, a top Alzheimer’s disease gene, found in only five ADSP cases and zero controls.ConclusionsWhile we could not formally assess the CR1 frameshift mutation in Alzheimer’s disease (insufficient sample-size), we believe it merits investigating in a larger cohort. There remain thousands of potentially important genomic regions overlooked by short-read sequencing that are largely resolved by long-read technologies.

Publisher

Cold Spring Harbor Laboratory

Reference113 articles.

1. Long-read sequencing across the C9orf72 ‘GGGGCC’ repeat expansion: implications for clinical use and genetic discovery efforts in human disease

2. Estimating the total number of phosphoproteins and phosphorylation sites in eukaryotic proteomes

3. Human brain shaped by duplicate genes

4. Inhibition of SRGAP2 Function by Its Human-Specific Paralogs Induces Neoteny during Spine Maturation

5. Evolution of Human-Specific Neural SRGAP2 Genes by Incomplete Segmental Duplication

Cited by 1 articles. 订阅此论文施引文献订阅此论文施引文献，注册后可以免费订阅5篇论文的施引文献，订阅后可以查看论文全部施引文献

1. Long‐read nanopore sequencing resolves a TMEM231 gene conversion event causing Meckel–Gruber syndrome;Human Mutation;2019-11-11