Cross-prediction-powered inference-Reference-Cited by-同舟云学术

Cross-prediction-powered inference

Published:2024-04-03 Issue:15 Volume:121 Page:
ISSN:0027-8424
Container-title:Proceedings of the National Academy of Sciences
language:en
Short-container-title:Proc. Natl. Acad. Sci. U.S.A.

Author:

Zrnic Tijana¹²^ORCID,Candès Emmanuel J.¹³

Affiliation:

1. Department of Statistics, Stanford University, Stanford, CA 94305

2. Stanford Data Science, Stanford University, Stanford, CA 94305

3. Department of Mathematics, Stanford University, Stanford, CA 94305

Abstract

While reliable data-driven decision-making hinges on high-quality labeled data, the acquisition of quality labels often involves laborious human annotations or slow and expensive scientific measurements. Machine learning is becoming an appealing alternative as sophisticated predictive techniques are being used to quickly and cheaply produce large amounts of predicted labels; e.g., predicted protein structures are used to supplement experimentally derived structures, predictions of socioeconomic indicators from satellite imagery are used to supplement accurate survey data, and so on. Since predictions are imperfect and potentially biased, this practice brings into question the validity of downstream inferences. We introduce cross-prediction: a method for valid inference powered by machine learning. With a small labeled dataset and a large unlabeled dataset, cross-prediction imputes the missing labels via machine learning and applies a form of debiasing to remedy the prediction inaccuracies. The resulting inferences achieve the desired error probability and are more powerful than those that only leverage the labeled data. Closely related is the recent proposal of prediction-powered inference [A. N. Angelopoulos, S. Bates, C. Fannjiang, M. I. Jordan, T. Zrnic, Science 382 , 669–674 (2023)], which assumes that a good pretrained model is already available. We show that cross-prediction is consistently more powerful than an adaptation of prediction-powered inference in which a fraction of the labeled data is split off and used to train the model. Finally, we observe that cross-prediction gives more stable conclusions than its competitors; its CIs typically have significantly lower variability.

Funder

DOD | USN | ONR | Office of Naval Research Global

National Science Foundation

Simons Foundation

DOD | USA | AFC | CCDC | Army Research Office

Publisher

Proceedings of the National Academy of Sciences

Link

https://pnas.org/doi/pdf/10.1073/pnas.2322083121

Reference60 articles.

1. Highly accurate protein structure prediction with AlphaFold

2. Highly accurate protein structure prediction for the human proteome

3. The structural context of posttranslational modifications at a proteome-wide scale

4. Evolutionary-scale prediction of atomic-level protein structure with a language model

5. ChatGPT Chemistry Assistant for Text Mining and the Prediction of MOF Synthesis