DataSAIL: Data Splitting Against Information Leakage-Reference-Cited by-同舟云学术

DataSAIL: Data Splitting Against Information Leakage

Published:2023-11-17 Issue: Volume: Page:
ISSN:
Container-title:
language:
Short-container-title:

Author:

Joeres Roman^ORCID,Blumenthal David B.^ORCID,Kalinina Olga V.^ORCID

Abstract

AbstractInformation Leakage is an increasing problem in machine learning research. It is a common practice to report models with benchmarks, comparing them to the state-of-the-art performance on the test splits of datasets. If two or more dataset splits contain identical or highly similar samples, a model risks simply memorizing them, and hence, the true performance is overestimated, which is one form of Information Leakage. Depending on the application of the model, the challenge is to find splits that minimize the similarity between data points in any two splits. Frequently, after reducing the similarity between training and test sets, one sees a considerable drop in performance, which is a signal of removed Information Leakage. Recent work has shown that Information Leakage is an emerging problem in model performance assessment.This work presents DataSAIL, a tool for splitting biological datasets while minimizing Information Leakage in different settings. This is done by splitting the dataset such that the total similarity of any two samples in different splits is minimized. To this end, we formulate data splitting as a Binary Linear Program (BLP) following the rules of Disciplined Quasi-Convex Programming (DQCP) and optimize a solution. DataSAIL can split one-dimensional data, e.g., for property prediction, and two-dimensional data, e.g., data organized as a matrix of binding affinities between two sets of molecules, accounting for similarities along each dimension and missing values. We compute splits of the MoleculeNet benchmarks using DeepChem, the LoHi splitter, GraphPart, and DataSAIL to compare their computational speed and quality. We show that DataSAIL can impose more complex learning tasks on machine learning models and allows for a better assessment of how well the model generalizes beyond the data presented during training.

Publisher

Cold Spring Harbor Laboratory

Reference45 articles.

1. Highly accurate protein structure prediction with AlphaFold

2. High-resolutionde novostructure prediction from primary sequence

3. OpenFold: Retraining AlphaFold2 yields new insights into its learning mechanisms and capacity for generalization

4. Evolutionary-scale prediction of atomic level protein structure with a language model

5. Roshan Rao et al. “Evaluating protein transfer learning with TAPE”. In: Advances in neural information processing systems 32 (2019).