Workflow analysis of data science code in public GitHub repositories-Reference-Cited by-同舟云学术

Workflow analysis of data science code in public GitHub repositories

Published:2022-11-19 Issue:1 Volume:28 Page:
ISSN:1382-3256
Container-title:Empirical Software Engineering
language:en
Short-container-title:Empir Software Eng

Author:

Ramasamy Dhivyabharathi,Sarasua Cristina,Bacchelli Alberto,Bernstein Abraham

Abstract

AbstractDespite the ubiquity of data science, we are far from rigorously understanding how coding in data science is performed. Even though the scientific literature has hinted at the iterative and explorative nature of data science coding, we need further empirical evidence to understand this practice and its workflows in detail. Such understanding is critical to recognise the needs of data scientists and, for instance, inform tooling support. To obtain a deeper understanding of the iterative and explorative nature of data science coding, we analysed 470 Jupyter notebooks publicly available in GitHub repositories. We focused on the extent to which data scientists transition between different types of data science activities, or steps (such as data preprocessing and modelling), as well as the frequency and co-occurrence of such transitions. For our analysis, we developed a dataset with the help of five data science experts, who manually annotated the data science steps for each code cell within the aforementioned 470 notebooks. Using the first-order Markov chain model, we extracted the transitions and analysed the transition probabilities between the different steps. In addition to providing deeper insights into the implementation practices of data science coding, our results provide evidence that the steps in a data science workflow are indeed iterative and reveal specific patterns. We also evaluated the use of the annotated dataset to train machine-learning classifiers to predict the data science step(s) of a given code cell. We investigate the representativeness of the classification by comparing the workflow analysis applied to (a) the predicted data set and (b) the data set labelled by experts, finding an F1-score of about 71% for the 10-class data science step prediction problem.

Funder

Schweizerischer Nationalfonds zur Förderung der Wissenschaftlichen Forschung

University of Zurich

Publisher

Springer Science and Business Media LLC

Subject

Software

Link

https://link.springer.com/content/pdf/10.1007/s10664-022-10229-z.pdf

Reference96 articles.

1. Aggarwal C, Bouneffouf D, Samulowitz H, Buesser B, Hoang T, Khurana U, Liu S, Pedapati T, Ram P, Rawat A, Wistuba M, Gray A (2019) How can ai automate end-to-end data science?arXiv:1910.14436

2. Altman DG (1990) Practical statistics for medical research. CRC press, Florida

3. Aragon C, Hutto C, Echenique A, Fiore-Gartland B, Huang Y, Kim J, Neff G, Xing W, Bayer J (2016) Developing a research agenda for human-centered data science. In: Proceedings of the 19th ACM conference on computer supported cooperative work and social computing companion, pp 529–535

4. Bacchelli A, Dal Sasso T, D’Ambros M, Lanza M (2012) Content classification of development emails

5. Barstad V, Goodwin M, Gjøsæter T (2014) Predicting source code quality with static analysis and machine learning. In: Norsk IKT-konferanse for forskning og utdanning

Cited by 4 articles. 订阅此论文施引文献订阅此论文施引文献，注册后可以免费订阅5篇论文的施引文献，订阅后可以查看论文全部施引文献

1. Static analysis driven enhancements for comprehension in machine learning notebooks;Empirical Software Engineering;2024-08-12

2. Evaluating Navigation and Comparison Performance of Computational Notebooks on Desktop and in Virtual Reality;Proceedings of the CHI Conference on Human Factors in Computing Systems;2024-05-11

3. A Large-Scale Study of ML-Related Python Projects;Proceedings of the 39th ACM/SIGAPP Symposium on Applied Computing;2024-04-08

4. Visualising data science workflows to support third-party notebook comprehension: an empirical study;Empirical Software Engineering;2023-03-23