VLIB: Unveiling insights through Visual and Linguistic Integration of Biorxiv data relevant to cancer via Multimodal Large Language Model
Author:
Prabhakar Vignesh, Liu KaiORCID
Abstract
AbstractThe field of cancer research has greatly benefited from the wealth of new knowledge provided by research articles and preprints on platforms like Biorxiv. This study investigates the role of scientific figures and their accompanying captions in enhancing our comprehension of cancer. Leveraging the capabilities of Multimodal Large Language Models (MLLMs), we conduct a comprehensive analysis of both visual and linguistic data in biomedical literature. Our work introduces VLIB, a substantial scientific figure-caption dataset generated from cancer biology papers on Biorxiv. After thorough preprocessing, which includes figure-caption pair extraction, sub-figure identification, and text normalization, VLIB comprises over 500,000 figures from more than 70,000 papers, each accompanied by relevant captions. We fine-tune baseline MLLMs using our VLIB dataset for downstream vision-language tasks, such as image captioning and visual question answering (VQA), to assess their performance. Our experimental results underscore the vital role played by scientific figures, including molecular structures, histological images, and data visualizations, in conjunction with their captions, in facilitating knowledge translation through MLLMs. Specifically, we achieved a ROUGE score of 0.66 for VQA and 0.68 for image captioning, as well as a BLEU score of 0.72 for VQA and 0.70 for image captioning. Furthermore, our investigation highlights the potential of MLLMs to bridge the gap between artificial intelligence and domain experts in the field of cancer biology.
Publisher
Cold Spring Harbor Laboratory
Reference32 articles.
1. Sever, R. , Roeder, T. , Hindle, S. , Sussman, L. , Black, K. J. , Argentine, J. , … & Inglis, J. R. (2019). bioRxiv: the preprint server for biology. BioRxiv, 833400. 2. Zhang, J. , Huang, J. , Jin, S. , & Lu, S. (2023). Vision-language models for vision tasks: A survey. arXiv preprint arXiv:2304.00685. 3. Zhang, S. , Xu, Y. , Usuyama, N. , Bagga, J. , Tinn, R. , Preston, S. , … & Poon, H. (2023). Large-scale domain-specific pretraining for biomedical vision-language processing. arXiv preprint arXiv:2303.00915. 4. Hsu, T. Y. , Giles, C. L. , & Huang, T. H. K. (2021). SciCap: Generating captions for scientific figures. arXiv preprint arXiv:2110.11624. 5. Visual question answering: A survey of methods and datasets;Computer Vision and Image Understanding,2017
Cited by
1 articles.
订阅此论文施引文献
订阅此论文施引文献,注册后可以免费订阅5篇论文的施引文献,订阅后可以查看论文全部施引文献
|
|