Building Flexible, Scalable, and Machine Learning-Ready Multimodal Oncology Datasets

Author:

Tripathi Aakash12ORCID,Waqas Asim12ORCID,Venkatesan Kavya1ORCID,Yilmaz Yasin2ORCID,Rasool Ghulam1234ORCID

Affiliation:

1. Department of Machine Learning, Moffitt Cancer Center & Research Institute, Tampa, FL 33612, USA

2. Department of Electrical Engineering, University of South Florida, Tampa, FL 33620, USA

3. Department of Neuro-Oncology, Moffitt Cancer Center & Research Institute, Tampa, FL 33612, USA

4. Department of Oncologic Sciences, University of South Florida, Tampa, FL 33612, USA

Abstract

The advancements in data acquisition, storage, and processing techniques have resulted in the rapid growth of heterogeneous medical data. Integrating radiological scans, histopathology images, and molecular information with clinical data is essential for developing a holistic understanding of the disease and optimizing treatment. The need for integrating data from multiple sources is further pronounced in complex diseases such as cancer for enabling precision medicine and personalized treatments. This work proposes Multimodal Integration of Oncology Data System (MINDS)—a flexible, scalable, and cost-effective metadata framework for efficiently fusing disparate data from public sources such as the Cancer Research Data Commons (CRDC) into an interconnected, patient-centric framework. MINDS consolidates over 41,000 cases from across repositories while achieving a high compression ratio relative to the 3.78 PB source data size. It offers sub-5-s query response times for interactive exploration. MINDS offers an interface for exploring relationships across data types and building cohorts for developing large-scale multimodal machine learning models. By harmonizing multimodal data, MINDS aims to potentially empower researchers with greater analytical ability to uncover diagnostic and prognostic insights and enable evidence-based personalized care. MINDS tracks granular end-to-end data provenance, ensuring reproducibility and transparency. The cloud-native architecture of MINDS can handle exponential data growth in a secure, cost-optimized manner while ensuring substantial storage optimization, replication avoidance, and dynamic access capabilities. Auto-scaling, access controls, and other mechanisms guarantee pipelines’ scalability and security. MINDS overcomes the limitations of existing biomedical data silos via an interoperable metadata-driven approach that represents a pivotal step toward the future of oncology data integration.

Funder

National Science Foundation

Publisher

MDPI AG

Reference59 articles.

1. Harnessing multimodal data integration to advance precision oncology;Boehm;Nat. Rev. Cancer,2021

2. Waqas, A., Dera, D., Rasool, G., Bouaynaya, N.C., and Fathallah-Shaykh, H.M. (2021). Deep Learning for Biomedical Data Analysis, Springer.

3. Multimodal learning with graphs;Ektefaie;Nat. Mach. Intell.,2023

4. Artificial intelligence for multimodal data integration in oncology;Lipkova;Cancer Cell,2022

5. Waqas, A., Tripathi, A., Ramachandran, R.P., Stewart, P., and Rasool, G. (2023). Multimodal Data Integration for Oncology in the Era of Deep Neural Networks: A Review. arXiv, Available online: https://arxiv.org/abs/2303.06471.

Cited by 4 articles. 订阅此论文施引文献 订阅此论文施引文献,注册后可以免费订阅5篇论文的施引文献,订阅后可以查看论文全部施引文献

同舟云学术

1.学者识别学者识别

2.学术分析学术分析

3.人才评估人才评估

"同舟云学术"是以全球学者为主线,采集、加工和组织学术论文而形成的新型学术文献查询和分析系统,可以对全球学者进行文献检索和人才价值评估。用户可以通过关注某些学科领域的顶尖人物而持续追踪该领域的学科进展和研究前沿。经过近期的数据扩容,当前同舟云学术共收录了国内外主流学术期刊6万余种,收集的期刊论文及会议论文总量共计约1.5亿篇,并以每天添加12000余篇中外论文的速度递增。我们也可以为用户提供个性化、定制化的学者数据。欢迎来电咨询!咨询电话:010-8811{复制后删除}0370

www.globalauthorid.com

TOP

Copyright © 2019-2024 北京同舟云网络信息技术有限公司
京公网安备11010802033243号  京ICP备18003416号-3