Characterizing Deep Learning Package Supply Chains in PyPI: Domains, Clusters, and Disengagement-Reference-Cited by-同舟云学术

Characterizing Deep Learning Package Supply Chains in PyPI: Domains, Clusters, and Disengagement

Published:2024-04-18 Issue:4 Volume:33 Page:1-27
ISSN:1049-331X
Container-title:ACM Transactions on Software Engineering and Methodology
language:en
Short-container-title:ACM Trans. Softw. Eng. Methodol.

Author:

Gao Kai¹^ORCID,He Runzhi²^ORCID,Xie Bing²^ORCID,Zhou Minghui²^ORCID

Affiliation:

1. School of Software & Microelectronics, Peking University, Beijing, China and Key Laboratory of High Confidence Software Technologies, Ministry of Education, China

2. School of Computer Science, Peking University, Beijing, China and Key Laboratory of High Confidence Software Technologies, Ministry of Education, China

Abstract

Deep learning (DL) frameworks have become the cornerstone of the rapidly developing DL field. Through installation dependencies specified in the distribution metadata, numerous packages directly or transitively depend on DL frameworks, layer after layer, forming DL package supply chains (SCs), which are critical for DL frameworks to remain competitive. However, vital knowledge on how to nurture and sustain DL package SCs is still lacking. Achieving this knowledge may help DL frameworks formulate effective measures to strengthen their SCs to remain competitive and shed light on dependency issues and practices in the DL SC for researchers and practitioners. In this paper, we explore the domains, clusters, and disengagement of packages in two representative PyPI DL package SCs to bridge this knowledge gap. We analyze the metadata of nearly six million PyPI package distributions and construct version-sensitive SCs for two popular DL frameworks: TensorFlow and PyTorch. We find that popular packages (measured by the number of monthly downloads) in the two SCs cover 34 domains belonging to eight categories. Applications , Infrastructure , and Sciences categories account for over 85% of popular packages in either SC and TensorFlow and PyTorch SC have developed specializations on Infrastructure and Applications packages, respectively. We employ the Leiden community detection algorithm and detect 131 and 100 clusters in the two SCs. The clusters mainly exhibit four shapes: Arrow, Star, Tree, and Forest with increasing dependency complexity. Most clusters are Arrow or Star, while Tree and Forest clusters account for most packages (Tensorflow SC: 70.7%, PyTorch SC: 92.9%). We identify three groups of reasons why packages disengage from the SC (i.e., remove the DL framework and its dependents from their installation dependencies): dependency issues, functional improvements, and ease of installation. The most common reason in TensorFlow SC is dependency incompatibility and in PyTorch SC is to simplify functionalities and reduce installation size. Our study provides rich implications for DL framework vendors, researchers, and practitioners on the maintenance and dependency management practices of PyPI DL SCs.

Funder

National Natural Science Foundation of China

Publisher

Association for Computing Machinery (ACM)

Link

https://dl.acm.org/doi/pdf/10.1145/3640336

Reference83 articles.

1. 2013. PEP 440-Version Identification and Dependency Specification. https://peps.python.org/pep-0440/(Accessed on 02/02/2023).

2. 2015. PEP 508 – Dependency Specification for Python Software Packages. https://peps.python.org/pep-0508(Accessed on 2022-08-04).

3. 2019. Microsoft/CNTK: Microsoft Cognitive Toolkit (CNTK) an Open Source Deep-Learning Toolkit. https://github.com/microsoft/CNTK#disclaimer(Accessed on 12/15/2022).

4. 2019. Preferred Networks Migrates its Deep Learning Research Platform to PyTorch - Preferred Networks Inc.https://www.preferred.jp/en/news/pr20191205/(Accessed on 12/15/2022).

5. 2021. SolarWinds Orion Security Breach: A Shift in the Software Supply Chain Paradigm. https://snyk.io/blog/solarwinds-orion-security-breach-a-shift-in-the-software-supply-chain-paradigm/(Accessed on 2022-08-02).

Cited by 2 articles. 订阅此论文施引文献订阅此论文施引文献，注册后可以免费订阅5篇论文的施引文献，订阅后可以查看论文全部施引文献

1. PyRadar: Towards Automatically Retrieving and Validating Source Code Repository Information for PyPI Packages;Proceedings of the ACM on Software Engineering;2024-07-12

2. Sustainability Forecasting for Deep Learning Packages;2024 IEEE International Conference on Software Analysis, Evolution and Reengineering (SANER);2024-03-12