Fold-stratified cross-validation for unbiased and privacy-preserving federated learning-Reference-Cited by-同舟云学术

Fold-stratified cross-validation for unbiased and privacy-preserving federated learning

Published:2020-07-04 Issue:8 Volume:27 Page:1244-1251
ISSN:1527-974X
Container-title:Journal of the American Medical Informatics Association
language:en
Short-container-title:

Author:

Bey Romain¹,Goussault Romain²,Grolleau François¹,Benchoufi Mehdi¹,Porcher Raphaël¹^ORCID

Affiliation:

1. Centre of Research in Epidemiology and Statistics (CRESS), Université de Paris, French Institute of Health and Medical Research (INSERM), National Institute of Agricultural Research (INRA), Paris, France

2. CIC 1413, Center for Research in Cancerology and Immunology Nantes-Angers (CRCINA), Dermatology Department, Centre Hospitalier Universitaire Nantes, Nantes University, Nantes, France

Abstract

Abstract Objective We introduce fold-stratified cross-validation, a validation methodology that is compatible with privacy-preserving federated learning and that prevents data leakage caused by duplicates of electronic health records (EHRs). Materials and Methods Fold-stratified cross-validation complements cross-validation with an initial stratification of EHRs in folds containing patients with similar characteristics, thus ensuring that duplicates of a record are jointly present either in training or in validation folds. Monte Carlo simulations are performed to investigate the properties of fold-stratified cross-validation in the case of a model data analysis using both synthetic data and MIMIC-III (Medical Information Mart for Intensive Care-III) medical records. Results In situations in which duplicated EHRs could induce overoptimistic estimations of accuracy, applying fold-stratified cross-validation prevented this bias, while not requiring full deduplication. However, a pessimistic bias might appear if the covariate used for the stratification was strongly associated with the outcome. Discussion Although fold-stratified cross-validation presents low computational overhead, to be efficient it requires the preliminary identification of a covariate that is both shared by duplicated records and weakly associated with the outcome. When available, the hash of a personal identifier or a patient’s date of birth provides such a covariate. On the contrary, pseudonymization interferes with fold-stratified cross-validation, as it may break the equality of the stratifying covariate among duplicates. Conclusion Fold-stratified cross-validation is an easy-to-implement methodology that prevents data leakage when a model is trained on distributed EHRs that contain duplicates, while preserving privacy.

Funder

Bpifrance

Publisher

Oxford University Press (OUP)

Subject

Health Informatics

Link

http://academic.oup.com/jamia/article-pdf/27/8/1244/34152945/ocaa096.pdf

Reference55 articles.

1. Dermatologist-level classification of skin cancer with deep neural networks;Esteva;Nature,2017

2. Artificial intelligence in radiology;Hosny;Nat Rev Cancer,2018

3. The Artificial Intelligence Clinician learns optimal treatment strategies for sepsis in intensive care;Komorowski;Nat Med,2018