BACKGROUND
Recently a number of studies demonstrated impressive performance on diverse vision-language multi-modal tasks such as image captioning and visual question answering by extending the self-attention based Transformer architecture with multi-modal pre-training objectives. Despite its huge potential, vision-language multi-modal pre-training in the medical domain has only recently received attention, only demonstrated improved diagnosis accuracy of vision-language pre-trained models.
OBJECTIVE
In this work we explore a broad set of multi-modal representation learning tasks in the medical domain, specifically using radiology images and the unstructured report.
METHODS
We propose Medical Vision Language Learner (MedViLL) which adopts a Transformer-based architecture combined with a novel multi-modal attention masking scheme to maximize generalization performance for both vision-language understanding tasks (disease classification, medical image-report retrieval, and medical visual question answering) and vision-language generation task (radiology report generation).
RESULTS
By rigorously evaluating the proposed model on four downstream tasks with three radiographic image-text datasets (MIMIC-CXR, Open-I, and VQA-RAD), we empirically demonstrate the superior downstream task performance and generality of MedViLL against various baselines including task-specific architectures. In addition, we qualitatively analyze MedViLL by showing the results of retrieved image-report pair, the attention map visualization, and generated reports.
CONCLUSIONS
Our proposed multi-modal pre-training model MedViLL could flexibly adapt to multiple downstream tasks of vision-language understanding and generation with a novel self-attention scheme. We believe that our approach can provide the basis for a wide range of interpretations of vision-language multi-modal in the medical domain.