A novel, computationally tractable algorithm flags in big matrices every column associated in any way with others or a dependent variable, with much higher power when columns are linked like mutations in chromosomes-Reference-Cited by-同舟云学术

A novel, computationally tractable algorithm flags in big matrices every column associated in any way with others or a dependent variable, with much higher power when columns are linked like mutations in chromosomes

Published:2021-09-17 Issue: Volume: Page:
ISSN:
Container-title:
language:
Short-container-title:

Author:

Antezana Marcos A.^ORCID

Abstract

ABSTRACTWhen a data matrix DM has many independent variables IVs, it is not computationally tractable to assess the association of every distinct IV subset with the dependent variable DV of the DM, because the number of subsets explodes combinatorially as IVs increase. But model selection and correcting for multiple tests is complex even with few IVs.DMs in genomics will soon summarize millions of mutation markers and genomes. Searching exhaustively in such DMs for markers that alone or synergistically with others are associated with a trait is therefore computationally tractable only for 1- and 2-marker effects. Also population geneticists study mainly 2-marker combinations.I present a computationally tractable, fully parallelizable Participation in Association Score (PAS) that in a DM with markers detects one by one every column that is strongly associated in any way with others. PAS does not examine column subsets and its computational cost grows linearly with the number of columns, remaining reasonable even when DMs have millions of columns.PAS exploits how associations of markers in the rows of a DM cause associations of matches in the rows’ pairwise comparisons. For every such comparison with a match at a tested column, PAS computes the matches at other columns by modifying the comparison’s total matches (scored once per DM), yielding a distribution of conditional matches that reacts diagnostically to the associations of the tested column. Equally computationally tractable is dvPAS that flags DV-associated IVs by also probing the matches at the DV.P values for the scores are readily obtained by permutation and accurately Sidak-corrected for multiple tests, bypassing model selection. The P values of a column’s PASs for different orders of association are i.i.d. and readily turned into a single P value.Simulations show that i) PAS and dvPAS generate uniform-(0,1)-distributed type I error in null DMs and ii) detect randomly encountered binary and trinary models of significant n-column association and n-IV association with a binary DV, respectively, with power in the order of magnitude of exhaustive evaluation’s and false positives that are uniform-(0,1)-distributed or straightforwardly tuned to be so. Power to detect 2-way associations that extend over 100+ columns is non-parametrically ultimate but that to detect pure n-column associations and pure n-IV DV associations sinks exponentially as n increases.Important for geneticists, dvPAS power increases about twofold in trinary vs. binary DMs and by orders of magnitude with markers linked like mutations in chromosomes, specially in trinary DMs where furthermore dvPAS fine-maps with highest resolution.

Publisher

Cold Spring Harbor Laboratory

Reference6 articles.

1. Type I Error and the Power of the s-Test: Old Lessons from a New, Analytically Justified Statistical Test for Phylogenies

2. Devroye L (1996). Non-uniform random variate generation. Springer-Verlag, Berlin.

3. On the Probability Theory of Linkage in Mendelian Heredity;Ann. Math. Statist,1944

4. Hedges L.V. and Olkin I. (1985). Statistical methods for meta-analysis. San Diego, CA, Academic Press.

5. DNA sequence diversity in a 9.7-kb region of the human lipoprotein lipase gene