Optimizing Data Pipelines for Machine Learning in Feature Stores-Reference-Cited by-同舟云学术

Optimizing Data Pipelines for Machine Learning in Feature Stores

Published:2023-09 Issue:13 Volume:16 Page:4230-4239
ISSN:2150-8097
Container-title:Proceedings of the VLDB Endowment
language:en
Short-container-title:Proc. VLDB Endow.

Author:

Liu Rui¹,Park Kwanghyun²,Psallidas Fotis³,Zhu Xiaoyong³,Mo Jinghui⁴,Sen Rathijit³,Interlandi Matteo³,Karanasos Konstantinos⁵,Tian Yuanyuan³,Camacho-Rodríguez Jesús³

Affiliation:

1. University of Chicago

2. Yonsei University

3. Microsoft

4. LinkedIn

5. Meta

Abstract

Data pipelines (i.e., converting raw data to features) are critical for machine learning (ML) models, yet their development and management is time-consuming. Feature stores have recently emerged as a new "DBMS-for-ML" with the premise of enabling data scientists and engineers to define and manage their data pipelines. While current feature stores fulfill their promise from a functionality perspective, they are resource-hungry---with ample opportunities for implementing database-style optimizations to enhance their performance. In this paper, we propose a novel set of optimizations specifically targeted for point-in-time join, which is a critical operation in data pipelines. We implement these optimizations on top of Feathr: a widely-used feature store, and evaluate them on use cases from both the TPCx-AI benchmark and real-world online retail scenarios. Our thorough experimental analysis shows that our optimizations can accelerate data pipelines by up to 3× over state-of-the-art baselines.

Publisher

Association for Computing Machinery (ACM)

Subject

General Earth and Planetary Sciences,Water Science and Technology,Geography, Planning and Development

Link

https://dl.acm.org/doi/pdf/10.14778/3625054.3625060

Reference61 articles.

1. 2019. Delta Lake. https://delta.io/. Accessed: 2023-02-23. 2019. Delta Lake. https://delta.io/. Accessed: 2023-02-23.

2. 2022. Amazon Redshift - Automated materialized views. https://docs.aws.amazon.com/redshift/latest/dg/materialized-view-auto-mv.html. Accessed: 2022-10-02. 2022. Amazon Redshift - Automated materialized views. https://docs.aws.amazon.com/redshift/latest/dg/materialized-view-auto-mv.html. Accessed: 2022-10-02.

3. 2022. Apache Spark. https://spark.apache.org/. Accessed: 2022-10-02. 2022. Apache Spark. https://spark.apache.org/. Accessed: 2022-10-02.

4. 2022. Apache Spark in Azure Synapse Analytics. https://learn.microsoft.com/azure/synapse-analytics/spark/apache-spark-overview. Accessed: 2022-10-02. 2022. Apache Spark in Azure Synapse Analytics. https://learn.microsoft.com/azure/synapse-analytics/spark/apache-spark-overview. Accessed: 2022-10-02.

5. 2022. Azure Blob Storage. https://azure.microsoft.com/en-us/products/storage/blobs. Accessed: 2022-10-02. 2022. Azure Blob Storage. https://azure.microsoft.com/en-us/products/storage/blobs. Accessed: 2022-10-02.

Cited by 1 articles. 订阅此论文施引文献订阅此论文施引文献，注册后可以免费订阅5篇论文的施引文献，订阅后可以查看论文全部施引文献

1. The Hopsworks Feature Store for Machine Learning;Companion of the 2024 International Conference on Management of Data;2024-06-09