Abstract
Abstract
Objectives
MAchine Learning In MyelomA Response (MALIMAR) is an observational clinical study combining “real-world” and clinical trial data, both retrospective and prospective. Images were acquired on three MRI scanners over a 10-year window at two institutions, leading to a need for extensive curation.
Methods
Curation involved image aggregation, pseudonymisation, allocation between project phases, data cleaning, upload to an XNAT repository visible from multiple sites, annotation, incorporation of machine learning research outputs and quality assurance using programmatic methods.
Results
A total of 796 whole-body MR imaging sessions from 462 subjects were curated. A major change in scan protocol part way through the retrospective window meant that approximately 30% of available imaging sessions had properties that differed significantly from the remainder of the data. Issues were found with a vendor-supplied clinical algorithm for “composing” whole-body images from multiple imaging stations. Historic weaknesses in a digital video disk (DVD) research archive (already addressed by the mid-2010s) were highlighted by incomplete datasets, some of which could not be completely recovered. The final dataset contained 736 imaging sessions for 432 subjects. Software was written to clean and harmonise data. Implications for the subsequent machine learning activity are considered.
Conclusions
MALIMAR exemplifies the vital role that curation plays in machine learning studies that use real-world data. A research repository such as XNAT facilitates day-to-day management, ensures robustness and consistency and enhances the value of the final dataset. The types of process described here will be vital for future large-scale multi-institutional and multi-national imaging projects.
Critical relevance statement
This article showcases innovative data curation methods using a state-of-the-art image repository platform; such tools will be vital for managing the large multi-institutional datasets required to train and validate generalisable ML algorithms and future foundation models in medical imaging.
Key points
• Heterogeneous data in the MALIMAR study required the development of novel curation strategies.
• Correction of multiple problems affecting the real-world data was successful, but implications for machine learning are still being evaluated.
• Modern image repositories have rich application programming interfaces enabling data enrichment and programmatic QA, making them much more than simple “image marts”.
Graphical Abstract
Funder
National Institute for Health and Care Research
NIHR Biomedical Research Centre, Royal Marsden NHS Foundation Trust/Institute of Cancer Research
Cancer Research UK
Publisher
Springer Science and Business Media LLC
Reference28 articles.
1. Messiou C, Porta N, Sharma B et al (2021) Prospective evaluation of whole-body MRI versus FDG PET/CT for lesion detection in participants with myeloma. Radiology 3:e210048
2. National Institute for Health and Care Excellence (2016) NICE guideline: myeloma: diagnosis and management
3. Dimopoulos MA, Hillengass J, Usmani S et al (2015) Role of magnetic resonance imaging in the management of patients with multiple myeloma: a consensus statement. J Clin Oncol 33:657–664
4. Rajkumar SV, Dimopoulos MA, Palumbo A et al (2014) International Myeloma Working Group updated criteria for the diagnosis of multiple myeloma. Lancet Oncol 15:e538–e548
5. Messiou C, Hillengass J, Delorme S et al (2019) Guidelines for acquisition, interpretation, and reporting of whole-body MRI in myeloma: myeloma response assessment and diagnosis system (MY-RADS). Radiology 291:5–13