Multi-modal Understanding and Generation for Medical Images and Text via Vision-Language Pre-Training: Development of Deep Learning Algorithm Study (Preprint)

Author:

Moon Jong HakORCID,Lee HyungyungORCID,Shin WoncheolORCID,Kim Young-Hak,Choi Edward

Abstract

BACKGROUND

Recently a number of studies demonstrated impressive performance on diverse vision-language multi-modal tasks such as image captioning and visual question answering by extending the self-attention based Transformer architecture with multi-modal pre-training objectives. Despite its huge potential, vision-language multi-modal pre-training in the medical domain has only recently received attention, only demonstrated improved diagnosis accuracy of vision-language pre-trained models.

OBJECTIVE

In this work we explore a broad set of multi-modal representation learning tasks in the medical domain, specifically using radiology images and the unstructured report.

METHODS

We propose Medical Vision Language Learner (MedViLL) which adopts a Transformer-based architecture combined with a novel multi-modal attention masking scheme to maximize generalization performance for both vision-language understanding tasks (disease classification, medical image-report retrieval, and medical visual question answering) and vision-language generation task (radiology report generation).

RESULTS

By rigorously evaluating the proposed model on four downstream tasks with three radiographic image-text datasets (MIMIC-CXR, Open-I, and VQA-RAD), we empirically demonstrate the superior downstream task performance and generality of MedViLL against various baselines including task-specific architectures. In addition, we qualitatively analyze MedViLL by showing the results of retrieved image-report pair, the attention map visualization, and generated reports.

CONCLUSIONS

Our proposed multi-modal pre-training model MedViLL could flexibly adapt to multiple downstream tasks of vision-language understanding and generation with a novel self-attention scheme. We believe that our approach can provide the basis for a wide range of interpretations of vision-language multi-modal in the medical domain.

Publisher

JMIR Publications Inc.

Cited by 7 articles. 订阅此论文施引文献 订阅此论文施引文献,注册后可以免费订阅5篇论文的施引文献,订阅后可以查看论文全部施引文献

同舟云学术

1.学者识别学者识别

2.学术分析学术分析

3.人才评估人才评估

"同舟云学术"是以全球学者为主线,采集、加工和组织学术论文而形成的新型学术文献查询和分析系统,可以对全球学者进行文献检索和人才价值评估。用户可以通过关注某些学科领域的顶尖人物而持续追踪该领域的学科进展和研究前沿。经过近期的数据扩容,当前同舟云学术共收录了国内外主流学术期刊6万余种,收集的期刊论文及会议论文总量共计约1.5亿篇,并以每天添加12000余篇中外论文的速度递增。我们也可以为用户提供个性化、定制化的学者数据。欢迎来电咨询!咨询电话:010-8811{复制后删除}0370

www.globalauthorid.com

TOP

Copyright © 2019-2024 北京同舟云网络信息技术有限公司
京公网安备11010802033243号  京ICP备18003416号-3