The Impact of Physician Variation on the Training and Performance of Deep Learning Auto-Segmentation Models: the Development of Physician Inconsistency Metrics

Author:

Yan Yujie1,Kehayias Christopher2,He John1,Aerts Hugo J.W.L.2,Fitzgerald Kelly J.1,Kann Benjamin H.1,Kozono David E.1,Guthier Christian V.1,Mak Raymond H.1

Affiliation:

1. Department of Radiation Oncology, Brigham and Women’s Hospital, Dana-Farber Cancer Institute, Harvard Medical School

2. Artificial Intelligence in Medicine (AIM) Program, Mass General Brigham, Harvard Medical School

Abstract

Abstract Manual segmentation of tumors and organs-at-risk (OAR) in 3D imaging for radiation-therapy planning is time-consuming and subject to variation between different observers. Artificial intelligence (AI) can assist with segmentation, but challenges exist in ensuring high-quality segmentation, especially for small, variable structures. We investigated the effect of variation in segmentation quality and style of physicians for training deep-learning models for esophagus segmentation and proposed a new metric, edge roughness, for evaluating/quantifying slice-to-slice inconsistency. This study includes a real-world cohort of 394 patients who each received radiation therapy (mainly for lung cancer). Segmentation of the esophagus was performed by 8 physicians as part of routine clinical care. We evaluated manual segmentation by comparing the length and edge roughness of segmentations among physicians to analyze inconsistencies. We trained six multiple- and individual-physician segmentation models in total, based on U-Net architectures and residual backbones. We used the volumetric Dice coefficient to measure the performance for each model. We proposed a metric, edge roughness, to quantify the shift of segmentation among adjacent slices by calculating the curvature of edges of the 2D sagittal- and coronal-view projections. The auto-segmentation model trained on multiple physicians (MD1-7) achieved the highest mean Dice of 73.7±14.8%. The individual-physician model (MD7) with the highest edge roughness (mean ± SD: 0.106±0.016) demonstrated significantly lower volumetric Dice for test cases compared with other individual models (MD7: 58.5±15.8%, MD6: 67.1±16.8%, p < 0.001). An additional multiple-physician model trained after removing the MD7 data resulted in fewer outliers (e.g., Dice £ 40%: 4 cases for MD1-6, 7 cases for MD1-7, Ntotal=394). This study demonstrates that there is significant variation in style and quality in manual segmentations in clinical care, and that training AI auto-segmentation algorithms from real-world, clinical datasets may result in unexpectedly under-performing algorithms with the inclusion of outliers. Importantly, this study provides a novel evaluation metric, edge roughness, to quantify physician variation in segmentation which will allow developers to filter clinical training data to optimize model performance.

Publisher

Research Square Platform LLC

Reference24 articles.

1. Cancer and radiation therapy: current advances and future directions;Baskar R;Int J Med Sci,2012

2. Assessment of consistency in contouring of normal-tissue anatomic structures;Collier DC;J Appl Clin Med Phys,2003

3. Machine learning for auto-segmentation in radiotherapy planning;Harrison K;Clin Oncol (R Coll Radiol),2022

4. Artificial intelligence in radiation oncology;Huynh E;Nat Rev Clin Oncol.,2020

5. Artificial intelligence in cancer imaging: Clinical challenges and applications;Bi WL;CA Cancer J Clin,2019

同舟云学术

1.学者识别学者识别

2.学术分析学术分析

3.人才评估人才评估

"同舟云学术"是以全球学者为主线,采集、加工和组织学术论文而形成的新型学术文献查询和分析系统,可以对全球学者进行文献检索和人才价值评估。用户可以通过关注某些学科领域的顶尖人物而持续追踪该领域的学科进展和研究前沿。经过近期的数据扩容,当前同舟云学术共收录了国内外主流学术期刊6万余种,收集的期刊论文及会议论文总量共计约1.5亿篇,并以每天添加12000余篇中外论文的速度递增。我们也可以为用户提供个性化、定制化的学者数据。欢迎来电咨询!咨询电话:010-8811{复制后删除}0370

www.globalauthorid.com

TOP

Copyright © 2019-2024 北京同舟云网络信息技术有限公司
京公网安备11010802033243号  京ICP备18003416号-3