Affiliation:
1. Samsung Advanced Institute for Health Sciences and Technology, Sung Kyun Kwan University
2. National Cancer Center
Abstract
Abstract
Background
Pathology reports provide important information for accurate diagnosis of cancer and optimal treatment decision making. In particular, breast cancer has known to be the most common cancer in women worldwide.
Objective
For the data extraction of breast cancer pathology reports in a single institute, we assessed the accuracy of methods between regular expression and natural language processing (NLP).
Methods
A total of 1,215 breast cancer pathology reports were annotated for NLP model development. As NLP models, we considered three BERT models with specific vocabularies including BERT-basic, BioBERT, and ClinicalBERT. K-fold cross-validation was used to verify the performance of the BERT model. The results between the regular expression and the BERT model were compared using the fuzzywuzzy algorithm.
Results
Among three BERT models, BioBERT was the most accurate parsing model (average performance = 0.99901) for breast cancer pathology when set to k = 5. BioBERT also had the lowest error rate for all items in the breast cancer pathology report compared to other BERT models (accuracy for all variables ≥ 0.9). Therefore, we finally selected BioBERT as the NLP model. When comparing the results of BioBERT and regular expressions using fuzzywuzzy algorithm, we identified that the BioBERT was more accurate than regular expression method, especially for some items such as intraductal_comp, lymph node, and lymphovascular invasion.
Conclusions
Our results showed that the NLP model, BioBERT, had higher accuracy than regular expression, suggesting the importance of BioBERT in the processing of breast cancer pathology reports.
Publisher
Research Square Platform LLC
Reference21 articles.
1. Breast Cancer Statistics in Korea, 2018;Kang SY;J Breast Cancer,2021
2. Achilonu, O. J., Singh, E., Nimako, G., Eijkemans, R. M. & Musenge, E. Rule-Based Information Extraction from Free-Text Pathology Reports Reveals Trends in South African Female Breast Cancer Molecular Subtypes and Ki67 Expression. BioMed Research International 2022 (2022).
3. Pattern-based information extraction from pathology reports for cancer registration;Napolitano G;Cancer Causes & Control,2010
4. Analysis of hormone receptor status in primary and recurrent breast cancer via data mining pathology reports;Chang K-P;Open Medicine,2019
5. Schadow, G. & McDonald, C. J. in AMIA Annual Symposium Proceedings. 584 (American Medical Informatics Association).