Affiliation:
1. School of Computer Science and Engineering, Northeastern University, Shenyang 110819, China
2. Northeastern University, Shenyang 110819, China
Abstract
The typical hypothesis testing issue in statistical analysis is determining whether a pattern is significantly associated with a specific class label. This usually leads to highly challenging multiple-hypothesis testing problems in big data mining scenarios, as millions or billions of hypothesis tests in large-scale exploratory data analysis can result in a large number of false positive results. The permutation testing-based FWER control method (PFWER) is theoretically effective in dealing with multiple hypothesis testing issues. In reality, however, this theoretical approach confronts a serious computational efficiency problem. It takes an extremely long time to compute an appropriate FWER false positive control threshold using PFWER, which is almost impossible to achieve in a reasonable amount of time using human effort on medium- or large-scale data. Although some methods for improving the efficiency of the FWER false positive control threshold calculation have been proposed, most of them are stand-alone, and there is still a lot of space for efficiency improvement. To address this problem, this paper proposes a distributed PFWER false-positive threshold calculation method for large-scale data. The computational effectiveness increases significantly when compared to the current approaches. The FP-growth algorithm is used first for pattern mining, and the mining process reduces the computation of invalid patterns by using pruning operations and index optimization for merging patterns with index transactions. The distributed computing technique is introduced on this basis, and the constructed FP tree is decomposed into a set of subtrees, each corresponding to a subtask. All subtrees (subtasks) are distributed to different computing nodes. Each node independently calculates the local significance threshold according to the designated subtasks. Finally, all local results are aggregated to compute the FWER false positive control threshold, which is completely consistent with the theoretical result. A series of experimental findings on 11 real-world datasets demonstrate that the distributed algorithm proposed in this paper can significantly improve the computation efficiency of PFWER while ensuring its theoretical accuracy.
Funder
National Natural Science Foundation of China
Subject
Fluid Flow and Transfer Processes,Computer Science Applications,Process Chemistry and Technology,General Engineering,Instrumentation,General Materials Science
Reference43 articles.
1. Bayesian Hypothesis Testing Illustrated: An Introduction for Software Engineering Researchers;Erdogmus;ACM Comput. Surv.,2023
2. Level Sets Semimetrics for Probability Measures with Applications in Hypothesis Testing;Munoz;Methodol. Comput. Appl. Probab.,2023
3. Customers’ self-image congruity and brand preference: A moderated mediation model of self-brand connection and self-motivation;Li;J. Prod. Brand Manag.,2022
4. Qualifying and raising anti-money laundering alarms with deep learning;Jensen;Expert Syst. Appl.,2023
5. Cao, L., Zhang, C., Joachims, T., Webb, G.I., Margineantu, D.D., and Williams, G. (2015, January 10–13). Fast and Memory-Efficient Significant Pattern Mining via Permutation Testing. Proceedings of the 21th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, Sydney, NSW, Australia.
Cited by
2 articles.
订阅此论文施引文献
订阅此论文施引文献,注册后可以免费订阅5篇论文的施引文献,订阅后可以查看论文全部施引文献