External Validation of an Ensemble Model for Automated Mammography Interpretation by Artificial Intelligence

Author:

Hsu William1,Hippe Daniel S.2,Nakhaei Noor1,Wang Pin-Chieh3,Zhu Bing1,Siu Nathan4,Ahsen Mehmet Eren5,Lotter William6,Sorensen A. Gregory6,Naeim Arash7,Buist Diana S. M.8,Schaffter Thomas9,Guinney Justin10,Elmore Joann G.3,Lee Christoph I.111213

Affiliation:

1. Medical and Imaging Informatics, Department of Radiological Sciences, David Geffen School of Medicine at University California, Los Angeles

2. Clinical Research Division, Fred Hutchinson Cancer Center, Seattle, Washington

3. Department of Medicine, David Geffen School of Medicine at University California, Los Angeles

4. Medical Informatics Home Area, Graduate Programs in Biosciences, David Geffen School of Medicine at University California, Los Angeles, Los Angeles, California

5. Gies College of Business, University of Illinois at Urbana-Champaign

6. DeepHealth, RadNet AI Solutions, Cambridge, Massachusetts

7. Center for Systematic, Measurable, Actionable, Resilient, and Technology-driven Health, Clinical and Translational Science Institute, David Geffen School of Medicine at University California, Los Angeles

8. Kaiser Permanente Washington Health Research Institute, Seattle, Washington

9. Computational Oncology, Sage Bionetworks, Seattle, Washington

10. Tempus Labs, Chicago, Illinois

11. Department of Radiology, University of Washington School of Medicine, Seattle

12. Department of Health Services, University of Washington School of Public Health, Seattle

13. Hutchinson Institute for Cancer Outcomes Research, Fred Hutchinson Cancer Center, Seattle, Washington

Abstract

ImportanceWith a shortfall in fellowship-trained breast radiologists, mammography screening programs are looking toward artificial intelligence (AI) to increase efficiency and diagnostic accuracy. External validation studies provide an initial assessment of how promising AI algorithms perform in different practice settings.ObjectiveTo externally validate an ensemble deep-learning model using data from a high-volume, distributed screening program of an academic health system with a diverse patient population.Design, Setting, and ParticipantsIn this diagnostic study, an ensemble learning method, which reweights outputs of the 11 highest-performing individual AI models from the Digital Mammography Dialogue on Reverse Engineering Assessment and Methods (DREAM) Mammography Challenge, was used to predict the cancer status of an individual using a standard set of screening mammography images. This study was conducted using retrospective patient data collected between 2010 and 2020 from women aged 40 years and older who underwent a routine breast screening examination and participated in the Athena Breast Health Network at the University of California, Los Angeles (UCLA).Main Outcomes and MeasuresPerformance of the challenge ensemble method (CEM) and the CEM combined with radiologist assessment (CEM+R) were compared with diagnosed ductal carcinoma in situ and invasive cancers within a year of the screening examination using performance metrics, such as sensitivity, specificity, and area under the receiver operating characteristic curve (AUROC).ResultsEvaluated on 37 317 examinations from 26 817 women (mean [SD] age, 58.4 [11.5] years), individual model AUROC estimates ranged from 0.77 (95% CI, 0.75-0.79) to 0.83 (95% CI, 0.81-0.85). The CEM model achieved an AUROC of 0.85 (95% CI, 0.84-0.87) in the UCLA cohort, lower than the performance achieved in the Kaiser Permanente Washington (AUROC, 0.90) and Karolinska Institute (AUROC, 0.92) cohorts. The CEM+R model achieved a sensitivity (0.813 [95% CI, 0.781-0.843] vs 0.826 [95% CI, 0.795-0.856]; P = .20) and specificity (0.925 [95% CI, 0.916-0.934] vs 0.930 [95% CI, 0.929-0.932]; P = .18) similar to the radiologist performance. The CEM+R model had significantly lower sensitivity (0.596 [95% CI, 0.466-0.717] vs 0.850 [95% CI, 0.766-0.923]; P < .001) and specificity (0.803 [95% CI, 0.734-0.861] vs 0.945 [95% CI, 0.936-0.954]; P < .001) than the radiologist in women with a prior history of breast cancer and Hispanic women (0.894 [95% CI, 0.873-0.910] vs 0.926 [95% CI, 0.919-0.933]; P = .004).Conclusions and RelevanceThis study found that the high performance of an ensemble deep-learning model for automated screening mammography interpretation did not generalize to a more diverse screening cohort, suggesting that the model experienced underspecification. This study suggests the need for model transparency and fine-tuning of AI models for specific target populations prior to their clinical adoption.

Publisher

American Medical Association (AMA)

Subject

General Medicine

同舟云学术

1.学者识别学者识别

2.学术分析学术分析

3.人才评估人才评估

"同舟云学术"是以全球学者为主线,采集、加工和组织学术论文而形成的新型学术文献查询和分析系统,可以对全球学者进行文献检索和人才价值评估。用户可以通过关注某些学科领域的顶尖人物而持续追踪该领域的学科进展和研究前沿。经过近期的数据扩容,当前同舟云学术共收录了国内外主流学术期刊6万余种,收集的期刊论文及会议论文总量共计约1.5亿篇,并以每天添加12000余篇中外论文的速度递增。我们也可以为用户提供个性化、定制化的学者数据。欢迎来电咨询!咨询电话:010-8811{复制后删除}0370

www.globalauthorid.com

TOP

Copyright © 2019-2024 北京同舟云网络信息技术有限公司
京公网安备11010802033243号  京ICP备18003416号-3