Abstract
AbstractGene expression data provides molecular insights into the functional impact of genetic variation, for example through expression quantitative trait loci (eQTL). With an improving understanding of the association between genotypes and gene expression comes a greater concern that gene expression profiles could be matched to genotype profiles of the same individuals in another dataset, known as a linking attack. Prior works demonstrating such a risk could analyze only a fraction of eQTLs that are independent due to restrictive model assumptions, leaving the full extent of this risk incompletely understood. To address this challenge, we introduce the discriminative sequence model (DSM), a novel probabilistic framework for predicting a sequence of genotypes based on gene expression data. By modeling the joint distribution over all known eQTLs in a genomic region, DSM improves the power of linking attacks with necessary calibration for linkage disequilibrium and redundant predictive signals. We demonstrate greater linking accuracy of DSM compared to existing approaches across a range of attack scenarios and datasets including up to 22K individuals, suggesting that DSM helps uncover a substantial additional risk overlooked by previous studies. Our work provides a unified framework for assessing the privacy risks of sharing diverse omics datasets beyond transcriptomics.
Publisher
Cold Spring Harbor Laboratory
Reference53 articles.
1. Differential privacy under dependent tuples—the case of genomic privacy;Bioinformatics,2020
2. ArrayExpress update – from bulk to single-cell expression data
3. Backes, M. , Berrang, P. , Bieg, M. , Eils, R. , Herrmann, C. , Humbert, M. , and Lehmann, I. (2017). Identifying personal dna methylation profiles by genotype inference. In 2017 IEEE Symposium on Security and Privacy (SP), pages 957–976. IEEE.
4. Barbeira, A. , Shah, K. P. , Torres, J. M. , Wheeler, H. E. , Torstenson, E. S. , Edwards, T. , Garcia, T. , Bell, G. I. , Nicolae, D. , Cox, N. J. , et al. (2018). Exploring the phenotypic consequences of tissue specific gene expression variation inferred from GWAS summary statistics. Nature Communications.
5. NCBI GEO: archive for functional genomics data sets—10 years on;Nucleic Acids Research,2010