Concept-Match Medical Data Scrubbing-Reference-Cited by-同舟云学术

Concept-Match Medical Data Scrubbing

Published:2003-06-01 Issue:6 Volume:127 Page:680-686
ISSN:1543-2165
Container-title:Archives of Pathology & Laboratory Medicine
language:en
Short-container-title:

Author:

Berman Jules J.¹

Affiliation:

1. From the Cancer Diagnosis Program, National Cancer Institute, Rockville, Md

Abstract

Abstract Context.—In the normal course of activity, pathologists create and archive immense data sets of scientifically valuable information. Researchers need pathology-based data sets, annotated with clinical information and linked to archived tissues, to discover and validate new diagnostic tests and therapies. Pathology records can be used for research purposes (without obtaining informed patient consent for each use of each record), provided the data are rendered harmless. Large data sets can be made harmless through 3 computational steps: (1) deidentification, the removal or modification of data fields that can be used to identify a patient (name, social security number, etc); (2) rendering the data ambiguous, ensuring that every data record in a public data set has a nonunique set of characterizing data; and (3) data scrubbing, the removal or transformation of words in free text that can be used to identify persons or that contain information that is incriminating or otherwise private. This article addresses the problem of data scrubbing. Objective.—To design and implement a general algorithm that scrubs pathology free text, removing all identifying or private information. Methods.—The Concept-Match algorithm steps through confidential text. When a medical term matching a standard nomenclature term is encountered, the term is replaced by a nomenclature code and a synonym for the original term. When a high-frequency “stop” word, such as a, an, the, or for, is encountered, it is left in place. When any other word is encountered, it is blocked and replaced by asterisks. This produces a scrubbed text. An open-source implementation of the algorithm is freely available. Results.—The Concept-Match scrub method transformed pathology free text into scrubbed output that preserved the sense of the original sentences, while it blocked terms that did not match terms found in the Unified Medical Language System (UMLS). The scrubbed product is safe, in the restricted sense that the output retains only standard medical terms. The software implementation scrubbed more than half a million surgical pathology report phrases in less than an hour. Conclusions.—Computerized scrubbing can render the textual portion of a pathology report harmless for research purposes. Scrubbing and deidentification methods allow pathologists to create and use large pathology databases to conduct medical research.

Publisher

Archives of Pathology and Laboratory Medicine

Subject

Medical Laboratory Technology,General Medicine,Pathology and Forensic Medicine

Link

http://meridian.allenpress.com/aplm/article-pdf/127/6/680/2730848/1543-2165(2003)127_680_cmds_2_0_co_2.pdf

Cited by 30 articles. 订阅此论文施引文献订阅此论文施引文献，注册后可以免费订阅5篇论文的施引文献，订阅后可以查看论文全部施引文献

1. Harnessing Moderate-Sized Language Models for Reliable Patient Data De-identification in Emergency Department Records: An Evaluation of Strategies and Performance (Preprint);2024-02-28

2. Formal verification and complexity analysis of confidentiality aware textual clinical documents framework;International Journal of Intelligent Systems;2021-06-11

3. Automated anonymization of text documents in Polish;Procedia Computer Science;2021

4. An Empirical Study of Applying Statistical Disclosure Control Methods to Public Health Research;International Journal of Environmental Research and Public Health;2019-11-15

5. Framework Design and Case Study for Privacy-Preserving Medical Data Publishing;International Journal of E-Health and Medical Communications;2013-10