Data Lifecycle Challenges in Production Machine Learning-Reference-Cited by-同舟云学术

Data Lifecycle Challenges in Production Machine Learning

Published:2018-12-11 Issue:2 Volume:47 Page:17-28
ISSN:0163-5808
Container-title:ACM SIGMOD Record
language:en
Short-container-title:SIGMOD Rec.

Author:

Polyzotis Neoklis¹,Roy Sudip¹,Whang Steven Euijong²,Zinkevich Martin¹

Affiliation:

1. Google Research, Mountain View, CA, USA

2. KAIST, Daejeon, South Korea

Abstract

Machine learning has become an essential tool for gleaning knowledge from data and tackling a diverse set of computationally hard tasks. However, the accuracy of a machine learned model is deeply tied to the data that it is trained on. Designing and building robust processes and tools that make it easier to analyze, validate, and transform data that is fed into large-scale machine learning systems poses data management challenges. Drawn from our experience in developing data-centric infrastructure for a production machine learning platform at Google, we summarize some of the interesting research challenges that we encountered, and survey some of the relevant literature from the data management and machine learning communities. Specifically, we explore challenges in three main areas of focus - data understanding, data validation and cleaning, and data preparation. In each of these areas, we try to explore how different constraints are imposed on the solutions depending on where in the lifecycle of a model the problems are encountered and who encounters them.

Publisher

Association for Computing Machinery (ACM)

Subject

Information Systems,Software

Link

https://dl.acm.org/doi/pdf/10.1145/3299887.3299891

Reference74 articles.

1. Deep learning for detection of diabetic eye disease. https://research.googleblog.com/2016/11/ deep-learning-for-detection-of-diabetic.html. Deep learning for detection of diabetic eye disease. https://research.googleblog.com/2016/11/ deep-learning-for-detection-of-diabetic.html.

2. Kaggle. https://www.kaggle.com/. Kaggle. https://www.kaggle.com/.

3. Keras. https://keras.io/. Keras. https://keras.io/.

4. Mxnet. https://mxnet.incubator.apache.org/. Mxnet. https://mxnet.incubator.apache.org/.

5. Tensorflow. https://www.tensorflow.org/. Tensorflow. https://www.tensorflow.org/.

Cited by 128 articles. 订阅此论文施引文献订阅此论文施引文献，注册后可以免费订阅5篇论文的施引文献，订阅后可以查看论文全部施引文献

1. Adaptive data quality scoring operations framework using drift-aware mechanism for industrial applications;Journal of Systems and Software;2024-11

2. Quality issues in machine learning software systems;Empirical Software Engineering;2024-09-11

3. AI generated route data pre - processing for faster decision making;2024 8th International Young Engineers Forum on Electrical and Computer Engineering (YEF-ECE);2024-07-05

4. A Data-Driven Method for Water Quality Analysis and Prediction for Localized Irrigation;AgriEngineering;2024-06-18

5. An empirical study of challenges in machine learning asset management;Empirical Software Engineering;2024-06-15