Affiliation:
1. Università di Padova, Padova, Italy
2. Brown University, Providence, RI
3. Amherst College, Amherst, MA
Abstract
“I’m an MC still as honest” – Eminem, Rap God
We present
MCRapper
, an algorithm for efficient computation of Monte-Carlo Empirical Rademacher Averages (MCERA) for families of functions exhibiting poset (e.g., lattice) structure, such as those that arise in many pattern mining tasks. The MCERA allows us to compute upper bounds to the maximum deviation of sample means from their expectations, thus it can be used to find both
(1)
statistically-significant functions (i.e., patterns) when the available data is seen as a sample from an unknown distribution, and
(2)
approximations of collections of high-expectation functions (e.g., frequent patterns) when the available data is a small sample from a large dataset. This flexibility offered by
MCRapper
is a big advantage over previously proposed solutions, which could only achieve one of the two.
MCRapper
uses upper bounds to the discrepancy of the functions to efficiently explore and prune the search space, a technique borrowed from pattern mining itself. To show the practical use of
MCRapper
, we employ it to develop an algorithm
TFP-R
for the task of True Frequent Pattern (TFP) mining, by appropriately computing approximations of the negative and positive borders of the collection of patterns of interest, which allow an effective pruning of the pattern space and the computation of strong bounds to the supremum deviation.
TFP-R
gives guarantees on the probability of including any false positives (precision) and exhibits higher statistical power (recall) than existing methods offering the same guarantees. We evaluate
MCRapper
and
TFP-R
and show that they outperform the state-of-the-art for their respective tasks.
Funder
National Science Foundation NSF
DARPA/ARFL
Italian Ministry of Education, University and Research
SID 2020: RATED-X
Publisher
Association for Computing Machinery (ACM)
Reference38 articles.
1. Mining association rules between sets of items in large databases
2. Efficient Graphlet Counting for Large Networks
3. Rademacher and Gaussian complexities: Risk bounds and structural results;Bartlett Peter L.;Journal of Machine Learning Research,2002
Cited by
6 articles.
订阅此论文施引文献
订阅此论文施引文献,注册后可以免费订阅5篇论文的施引文献,订阅后可以查看论文全部施引文献