Program Analysis for Adaptive Data Analysis-Reference-Cited by-同舟云学术

Program Analysis for Adaptive Data Analysis

Published:2024-06-20 Issue:PLDI Volume:8 Page:914-938
ISSN:2475-1421
Container-title:Proceedings of the ACM on Programming Languages
language:en
Short-container-title:Proc. ACM Program. Lang.

Author:

Liu Jiawen¹^ORCID,Qu Weihao²^ORCID,Gaboardi Marco¹^ORCID,Garg Deepak³^ORCID,Ullman Jonathan⁴^ORCID

Affiliation:

1. Boston University, Boston, USA

2. Monmouth University, West Long Branch, USA

3. MPI-SWS, MPI-SWS, Germany

4. Northeastern University, Boston, USA

Abstract

Data analyses are usually designed to identify some property of the population from which the data are drawn, generalizing beyond the specific data sample. For this reason, data analyses are often designed in a way that guarantees that they produce a low generalization error. That is, they are designed so that the result of a data analysis run on a sample data does not differ too much from the result one would achieve by running the analysis over the entire population. An adaptive data analysis can be seen as a process composed by multiple queries interrogating some data, where the choice of which query to run next may rely on the results of previous queries. The generalization error of each individual query/analysis can be controlled by using an array of well-established statistical techniques. However, when queries are arbitrarily composed, the different errors can propagate through the chain of different queries and bring to a high generalization error. To address this issue, data analysts are designing several techniques that not only guarantee bounds on the generalization errors of single queries, but that also guarantee bounds on the generalization error of the composed analyses. The choice of which of these techniques to use, often depends on the chain of queries that an adaptive data analysis can generate. In this work, we consider adaptive data analyses implemented as while-like programs and we design a program analysis which can help with identifying which technique to use to control their generalization errors. More specifically, we formalize the intuitive notion of adaptivity as a quantitative property of programs. We do this because the adaptivity level of a data analysis is a key measure to choose the right technique. Based on this definition, we design a program analysis for soundly approximating this quantity. The program analysis generates a representation of the data analysis as a weighted dependency graph, where the weight is an upper bound on the number of times each variable can be reached, and uses a path search strategy to guarantee an upper bound on the adaptivity. We implement our program analysis and show that it can help to analyze the adaptivity of several concrete data analyses with different adaptivity structures.

Publisher

Association for Computing Machinery (ACM)

Link

https://dl.acm.org/doi/pdf/10.1145/3656414

Reference46 articles.

1. Higher-order probabilistic adversarial computations: categorical semantics and program logics

2. Aws Albarghouthi, Loris D’Antoni, Samuel Drews, and Aditya V Nori. 2017. Fairsquare: probabilistic verification of program fairness. Proceedings of the ACM on Programming Languages, 1, OOPSLA (2017), 1–30.

3. Fairness-Aware Programming

4. Automatic Inference of Upper Bounds for Recurrence Relations in Cost Analysis

5. Multi-dimensional Rankings, Program Termination, and Complexity Bounds of Flowchart Programs