Autoencoders for sample size estimation for fully connected neural network classifiers-Reference-Cited by-同舟云学术

Autoencoders for sample size estimation for fully connected neural network classifiers

Published:2022-12-13 Issue:1 Volume:5 Page:
ISSN:2398-6352
Container-title:npj Digital Medicine
language:en
Short-container-title:npj Digit. Med.

Author:

Gulamali Faris F.^ORCID,Sawant Ashwin S.^ORCID,Kovatch Patricia^ORCID,Glicksberg Benjamin^ORCID,Charney Alexander,Nadkarni Girish N.^ORCID,Oermann Eric

Abstract

AbstractSample size estimation is a crucial step in experimental design but is understudied in the context of deep learning. Currently, estimating the quantity of labeled data needed to train a classifier to a desired performance, is largely based on prior experience with similar models and problems or on untested heuristics. In many supervised machine learning applications, data labeling can be expensive and time-consuming and would benefit from a more rigorous means of estimating labeling requirements. Here, we study the problem of estimating the minimum sample size of labeled training data necessary for training computer vision models as an exemplar for other deep learning problems. We consider the problem of identifying the minimal number of labeled data points to achieve a generalizable representation of the data, a minimum converging sample (MCS). We use autoencoder loss to estimate the MCS for fully connected neural network classifiers. At sample sizes smaller than the MCS estimate, fully connected networks fail to distinguish classes, and at sample sizes above the MCS estimate, generalizability strongly correlates with the loss function of the autoencoder. We provide an easily accessible, code-free, and dataset-agnostic tool to estimate sample sizes for fully connected networks. Taken together, our findings suggest that MCS and convergence estimation are promising methods to guide sample size estimates for data collection and labeling prior to training deep learning models in computer vision.

Publisher

Springer Science and Business Media LLC

Subject

Health Information Management,Health Informatics,Computer Science Applications,Medicine (miscellaneous)

Link

https://www.nature.com/articles/s41746-022-00728-0.pdf

Reference34 articles.

1. Sambasivan, N. et al. "everyone wants to do the model work, not the data work”: Data cascades in high-stakes ai. In Proceedings of the 2021 CHI Conference on Human Factors in Computing Systems, CHI ’21 (Association for Computing Machinery, New York, NY, USA, 2021).

2. Goodfellow, I., Bengio, Y. & Courville, A.Deep Learning, chap. 14 Autoencoders (MIT Press, 2016).

3. Deng, L. The mnist database of handwritten digit images for machine learning research [best of the web]. IEEE Signal Process. Mag. 29, 141–142 (2012).

4. Cohen, G., Afshar, S., Tapson, J. & Van Schaik, A. Emnist: Extending mnist to handwritten letters. In 2017 International Joint Conference on Neural Networks (IJCNN), 2921-2926 (IEEE, 2017).

5. Yadav, C. & Bottou, L. Cold case: The lost mnist digits.

Cited by 4 articles. 订阅此论文施引文献订阅此论文施引文献，注册后可以免费订阅5篇论文的施引文献，订阅后可以查看论文全部施引文献

1. Predicting blood–brain barrier permeability of molecules with a large language model and machine learning;Scientific Reports;2024-07-09

2. Bibliography;Reproducibility in Biomedical Research;2024

3. An AI-Guided Data Centric Strategy to Detect and Mitigate Biases in Healthcare Datasets;2023-11-07

4. A methodology to determine the optimal train-set size for autoencoders applied to energy systems;Advanced Engineering Informatics;2023-10