Abstract
ABSTRACTWhen a data matrix DM has many independent variables IVs, it is not computationally tractable to assess the association of every distinct IV subset with the dependent variable DV of the DM, because the number of subsets explodes combinatorially as IVs increase. But model selection and correcting for multiple tests is complex even with few IVs.DMs in genomics will soon summarize millions of mutation markers and genomes. Searching exhaustively in such DMs for markers that alone or synergistically with others are associated with a trait is therefore computationally tractable only for 1- and 2-marker effects. Also population geneticists study mainly 2-marker combinations.I present a computationally tractable, fully parallelizable Participation in Association Score (PAS) that in a DM with markers detects one by one every column that is strongly associated in any way with others. PAS does not examine column subsets and its computational cost grows linearly with the number of columns, remaining reasonable even when DMs have millions of columns.PAS exploits how associations of markers in the rows of a DM cause associations of matches in the rows’ pairwise comparisons. For every such comparison with a match at a tested column, PAS computes the matches at other columns by modifying the comparison’s total matches (scored once per DM), yielding a distribution of conditional matches that reacts diagnostically to the associations of the tested column. Equally computationally tractable is dvPAS that flags DV-associated IVs by also probing the matches at the DV.P values for the scores are readily obtained by permutation and accurately Sidak-corrected for multiple tests, bypassing model selection. The P values of a column’s PASs for different orders of association are i.i.d. and readily turned into a single P value.Simulations show that i) PAS and dvPAS generate uniform-(0,1)-distributed type I error in null DMs and ii) detect randomly encountered binary and trinary models of significant n-column association and n-IV association with a binary DV, respectively, with power in the order of magnitude of exhaustive evaluation’s and false positives that are uniform-(0,1)-distributed or straightforwardly tuned to be so. Power to detect 2-way associations that extend over 100+ columns is non-parametrically ultimate but that to detect pure n-column associations and pure n-IV DV associations sinks exponentially as n increases.Important for geneticists, dvPAS power increases about twofold in trinary vs. binary DMs and by orders of magnitude with markers linked like mutations in chromosomes, specially in trinary DMs where furthermore dvPAS fine-maps with highest resolution.
Publisher
Cold Spring Harbor Laboratory