Affiliation:
1. Research and Practical Clinical Center for Diagnostics and Telemedicine Technologies of Moscow Health Care Department
2. Federal Research Center “Computer Science and Control” of the Russian Academy of Sciences
3. Research and Practical Clinical Center for Diagnostics and Telemedicine Technologies of Moscow Health Care Department; Research center of neurology
Abstract
The aim of the study is to develop a method for detecting areas of text with private data on medical diagnostic images using the Tesseract module and the modified Levenshtein distance.Materials and methods. For threshold filtering, the brightness of the points belonging to the text characters in the images is determined at the initial stage. The dynamic threshold is calculated from the histogram of the brightness of the pixels of the image. Next, the Tesseract module is used for primary text recognition. Based on the tag values from DICOM files, a set of strings was formed to search for them in the recognized text. A modified Levenshtein distance was used to search for these strings. A set of DICOM files of the “Dose Report” type was used to test the algorithm. The accuracy was assessed by experts marking up blocks of private information on images.Results. A tool has been developed with a set of metrics and optimal thresholds for choosing decisive rules in finding matches that allow detecting areas of text with private data on medical images. For this tool, the accuracy of localization of areas with personal data on a set of 1131 medical images was determined in comparison with expert markup, which is 99.86%.Conclusion. The tool developed within the framework of this study allows identifying personal data on digital medical images with high accuracy, which indicates the possibility of its practical application in the preparation of data sets.
Subject
Radiology, Nuclear Medicine and imaging,Radiological and Ultrasound Technology