Development and Validation of a Natural Language Processing Algorithm for Extracting Clinical and Pathological Features of Breast Cancer From Pathology Reports

Author:

Munzone Elisabetta1ORCID,Marra Antonio2ORCID,Comotto Federico3,Guercio Lorenzo3,Sangalli Claudia Anna4,Lo Cascio Martina5,Pagan Eleonora6,Sangalli Davide5,Bigoni Ilaria3,Porta Francesca Maria7,D'Ercole Marianna7,Ritorti Fabiana3,Bagnardi Vincenzo6ORCID,Fusco Nicola78ORCID,Curigliano Giuseppe28ORCID

Affiliation:

1. Division of Medical Senology, European Institute of Oncology IRCCS, Milan, Italy

2. Division of Early Drug Development for Innovative Therapies, European Institute of Oncology IRCCS, Milan, Italy

3. Reply S.p.A., Turin, Italy

4. Clinical Trial Office, European Institute of Oncology IRCCS, Milan, Italy

5. Central Management of Information Systems and Technologies, European Institute of Oncology IRCCS, Milan, Italy

6. Department of Statistics and Quantitative Methods, University of Milan-Bicocca, Milan, Italy

7. Division of Pathology, European Institute of Oncology IRCCS, Milan, Italy

8. Department of Oncology and Hemato-Oncology, University of Milan, Milan, Italy

Abstract

PURPOSE Electronic health records (EHRs) are valuable information repositories that offer insights for enhancing clinical research on breast cancer (BC) using real-world data. The objective of this study was to develop a natural language processing (NLP) model specifically designed to extract structured data from BC pathology reports written in natural language. METHODS During the initial phase, the algorithm's development cohort comprised 193 pathology reports from 116 patients with BC from 2012 to 2016. A rule-based NLP algorithm was applied to extract 26 variables for analysis and was compared with the manual extraction of data performed by both a data entry specialist and an oncologist. Following the first approach, the data set was expanded to include 513 reports, and a Named Entity Recognition (NER)-NLP model was trained and evaluated using K-fold cross-validation. RESULTS The first approach led to a concordance analysis, which revealed an 82.9% agreement between the algorithm and the oncologist, whereas the concordance between the data entry specialist and the oncologist was 90.8%. The second training approach introduced the definition of an NER-NLP model, in which the accuracy showed remarkable potential (97.8%). Notably, the model demonstrated remarkable performance, especially for parameters such as estrogen receptor, progesterone receptor, human epidermal growth factor receptor 2, and Ki-67 (F1-score 1.0). CONCLUSION The present study aligns with the rapidly evolving field of artificial intelligence (AI) applications in oncology, seeking to expedite the development of complex cancer databases and registries. The results of the model are currently undergoing postprocessing procedures to organize the data into tabular structures, facilitating their utilization in real-world clinical and research endeavors.

Publisher

American Society of Clinical Oncology (ASCO)

同舟云学术

1.学者识别学者识别

2.学术分析学术分析

3.人才评估人才评估

"同舟云学术"是以全球学者为主线,采集、加工和组织学术论文而形成的新型学术文献查询和分析系统,可以对全球学者进行文献检索和人才价值评估。用户可以通过关注某些学科领域的顶尖人物而持续追踪该领域的学科进展和研究前沿。经过近期的数据扩容,当前同舟云学术共收录了国内外主流学术期刊6万余种,收集的期刊论文及会议论文总量共计约1.5亿篇,并以每天添加12000余篇中外论文的速度递增。我们也可以为用户提供个性化、定制化的学者数据。欢迎来电咨询!咨询电话:010-8811{复制后删除}0370

www.globalauthorid.com

TOP

Copyright © 2019-2024 北京同舟云网络信息技术有限公司
京公网安备11010802033243号  京ICP备18003416号-3