Testing the generalizability and effectiveness of deep learning models among clinics: sperm detection as a pilot study-Reference-Cited by-同舟云学术

Testing the generalizability and effectiveness of deep learning models among clinics: sperm detection as a pilot study

Published:2024-05-22 Issue:1 Volume:22 Page:
ISSN:1477-7827
Container-title:Reproductive Biology and Endocrinology
language:en
Short-container-title:Reprod Biol Endocrinol

Author:

Wang Jiaqi,Jin Yufei,Jiang Aojun,Chen Wenyuan,Shan Guanqiao,Gu Yifan,Ming Yue,Li Jichang,Yue Chunfeng,Huang Zongjie,Librach Clifford,Lin Ge,Wang Xibu,Zhao Huan,Sun Yu,Zhang Zhuoran

Abstract

Abstract Background Deep learning has been increasingly investigated for assisting clinical in vitro fertilization (IVF). The first technical step in many tasks is to visually detect and locate sperm, oocytes, and embryos in images. For clinical deployment of such deep learning models, different clinics use different image acquisition hardware and different sample preprocessing protocols, raising the concern over whether the reported accuracy of a deep learning model by one clinic could be reproduced in another clinic. Here we aim to investigate the effect of each imaging factor on the generalizability of object detection models, using sperm analysis as a pilot example. Methods Ablation studies were performed using state-of-the-art models for detecting human sperm to quantitatively assess how model precision (false-positive detection) and recall (missed detection) were affected by imaging magnification, imaging mode, and sample preprocessing protocols. The results led to the hypothesis that the richness of image acquisition conditions in a training dataset deterministically affects model generalizability. The hypothesis was tested by first enriching the training dataset with a wide range of imaging conditions, then validated through internal blind tests on new samples and external multi-center clinical validations. Results Ablation experiments revealed that removing subsets of data from the training dataset significantly reduced model precision. Removing raw sample images from the training dataset caused the largest drop in model precision, whereas removing 20x images caused the largest drop in model recall. by incorporating different imaging and sample preprocessing conditions into a rich training dataset, the model achieved an intraclass correlation coefficient (ICC) of 0.97 (95% CI: 0.94-0.99) for precision, and an ICC of 0.97 (95% CI: 0.93-0.99) for recall. Multi-center clinical validation showed no significant differences in model precision or recall across different clinics and applications. Conclusions The results validated the hypothesis that the richness of data in the training dataset is a key factor impacting model generalizability. These findings highlight the importance of diversity in a training dataset for model evaluation and suggest that future deep learning models in andrology and reproductive medicine should incorporate comprehensive feature sets for enhanced generalizability across clinics.

Funder

National Key Research and Development Program of China

National Natural Science Foundation of China

Guangdong Basic and Applied Basic Research Foundation

Shenzhen Science and Technology Innovation Program

Chinese University of Hong Kong, Shenzhen

Publisher

Springer Science and Business Media LLC

Link

https://link.springer.com/content/pdf/10.1186/s12958-024-01232-8.pdf

Reference66 articles.

1. Gadadhar S, Alvarez Viar G, Hansen JN, Gong A, Kostarev A, Ialy-Radio C, et al. Tubulin glycylation controls axonemal dynein activity, flagellar beat, and male fertility. Science. 2021;371(6525):eabd4914. https://www.science.org/doi/abs/10.1126/science.abd4914.

2. Li X, Li C, Rahaman MM, Sun H, Li X, Wu J, et al. A comprehensive review of computer-aided whole-slide image analysis: from datasets to feature extraction, segmentation, classification and detection approaches. Artif Intell Rev. 2022;55(6):4809–78. https://link.springer.com/article/10.1007/s10462-021-10121-0.

3. Marino JL, Moore VM, Rumbold AR, Davies MJ. Fertility treatments and the young women who use them: an Australian cohort study. Hum Reprod. 2011;26(2):473–9. https://academic.oup.com/humrep/article-abstract/26/2/473/593755.

4. Stouffs K, Tournaye H, Van der Elst J, Liebaers I, Lissens W. Is there a role for the nuclear export factor 2 gene in male infertility? Fertil Steril. 2008;90(5):1787–91. https://www.sciencedirect.com/science/article/pii/S001502820703467X.

5. Ström P, Kartasalo K, Olsson H, Solorzano L, Delahunt B, Berney DM, et al. Artificial intelligence for diagnosis and grading of prostate cancer in biopsies: a population-based, diagnostic study. Lancet Oncol. 2020;21(2):222–32. https://www.thelancet.com/journals/lanonc/article/PIIS1470-2045(19)30738-7/fulltext?13570.