Towards efficient AutoML: a pipeline synthesis approach leveraging pre-trained transformers for multimodal data-Reference-Cited by-同舟云学术

Towards efficient AutoML: a pipeline synthesis approach leveraging pre-trained transformers for multimodal data

Published:2024-07-19 Issue:9 Volume:113 Page:7011-7053
ISSN:0885-6125
Container-title:Machine Learning
language:en
Short-container-title:Mach Learn

Author:

Moharil Ambarish,Vanschoren Joaquin,Singh Prabhant,Tamburri Damian

Abstract

AbstractThis paper introduces an Automated Machine Learning (AutoML) framework specifically designed to efficiently synthesize end-to-end multimodal machine learning pipelines. Traditional reliance on the computationally demanding Neural Architecture Search is minimized through the strategic integration of pre-trained transformer models. This innovative approach enables the effective unification of diverse data modalities into high-dimensional embeddings, streamlining the pipeline development process. We leverage an advanced Bayesian Optimization strategy, informed by meta-learning, to facilitate the warm-starting of the pipeline synthesis, thereby enhancing computational efficiency. Our methodology demonstrates its potential to create advanced and custom multimodal pipelines within limited computational resources. Extensive testing across 23 varied multimodal datasets indicates the promise and utility of our framework in diverse scenarios. The results contribute to the ongoing efforts in the AutoML field, suggesting new possibilities for efficiently handling complex multimodal data. This research represents a step towards developing more efficient and versatile tools in multimodal machine learning pipeline development, acknowledging the collaborative and ever-evolving nature of this field.

Funder

Horizon 2020

Publisher

Springer Science and Business Media LLC

Link

https://link.springer.com/content/pdf/10.1007/s10994-024-06568-1.pdf

Reference38 articles.

1. Agrawal, A., Lu, J., Antol, S., Mitchell, M., Zitnick, C. L., Batra, D., & Parikh, D. (2015). VQA: Visual question answering. ArXiv. https://doi.org/10.48550/ARXIV.1505.00468

2. Baevski, A., Hsu, W.-N., Xu, Q., Babu, A., Gu, J., & Auli, M. (2022). data2vec: A General framework for self-supervised learning in speech, vision and language. arXivhttps://doi.org/10.48550/ARXIV.2202.03555. https://arxiv.org/abs/2202.03555

3. Barrett, L. F. (2017). How emotions are made: The secret life of the brain. Houghton Mifflin Harcourt. https://books.google.nl/books?id=hN8MBgAAQBAJ

4. Devlin, J., Chang, M.-W., Lee, K., & Toutanova, K. (2018). BERT: Pre-training of deep bidirectional transformers for language understanding. arXiv. https://doi.org/10.48550/ARXIV.1810.04805

5. Dosovitskiy, A., Beyer, L., Kolesnikov, A., Weissenborn, D., Zhai, X., Unterthiner, T., Dehghani, M., Minderer, M., Heigold, G., Gelly, S., Uszkoreit, J., & Houlsby, N. (2020). An image is worth 16 × 16 words: Transformers for image recognition at scale. arXiv. https://doi.org/10.48550/ARXIV.2010.11929