Abstract
AbstractIn cancer research, pathology report text is a largely un-tapped data source. Pathology reports are routinely generated, more nuanced than structured data, and contain added insight from pathologists. However, there are no publicly-available datasets for benchmarking report-based models. Two recent advances suggest the urgent need for a benchmark dataset. First, improved optical character recognition (OCR) techniques will make it possible to access older pathology reports in an automated way, increasing data available for analysis. Second, recent improvements in natural language processing (NLP) techniques using AI allow more accurate prediction of clinical targets from text. We apply state-of-the-art OCR and customized post- processing to publicly available report PDFs from The Cancer Genome Atlas, generating a machine-readable corpus of 9,523 reports. We perform a proof-of-principle cancer-type classification across 32 tissues, achieving 0.992 average AU-ROC. This dataset will be useful to researchers across specialties, including research clinicians, clinical trial investigators, and clinical NLP researchers.
Publisher
Cold Spring Harbor Laboratory
Reference36 articles.
1. The Cancer Genome Atlas Pan-Cancer analysis project
2. Using machine learning to parse breast pathology reports
3. Automatic extraction of cancer registry reportable information from free-text pathology reports using multitask convolutional neural networks
4. Comparison of machine-learning algorithms for the prediction of Current Procedural Terminology (CPT) codes from pathology reports
5. Ma, R. , Chen, P. H. C. , Li, G. , Weng, W. H. , Lin, A. , Gadepalli, K. , and Cai, Y. (2019). Human-centric Metric for Accelerating Pathology Reports Annotation. arXiv preprint, arXiv:1911.01226.
Cited by
1 articles.
订阅此论文施引文献
订阅此论文施引文献,注册后可以免费订阅5篇论文的施引文献,订阅后可以查看论文全部施引文献