Characterizing Deep Learning Package Supply Chains in PyPI: Domains, Clusters, and Disengagement

Author:

Gao Kai1ORCID,He Runzhi2ORCID,Xie Bing2ORCID,Zhou Minghui2ORCID

Affiliation:

1. School of Software & Microelectronics, Peking University, Beijing, China and Key Laboratory of High Confidence Software Technologies, Ministry of Education, China

2. School of Computer Science, Peking University, Beijing, China and Key Laboratory of High Confidence Software Technologies, Ministry of Education, China

Abstract

Deep learning (DL) frameworks have become the cornerstone of the rapidly developing DL field. Through installation dependencies specified in the distribution metadata, numerous packages directly or transitively depend on DL frameworks, layer after layer, forming DL package supply chains (SCs), which are critical for DL frameworks to remain competitive. However, vital knowledge on how to nurture and sustain DL package SCs is still lacking. Achieving this knowledge may help DL frameworks formulate effective measures to strengthen their SCs to remain competitive and shed light on dependency issues and practices in the DL SC for researchers and practitioners. In this paper, we explore the domains, clusters, and disengagement of packages in two representative PyPI DL package SCs to bridge this knowledge gap. We analyze the metadata of nearly six million PyPI package distributions and construct version-sensitive SCs for two popular DL frameworks: TensorFlow and PyTorch. We find that popular packages (measured by the number of monthly downloads) in the two SCs cover 34 domains belonging to eight categories. Applications , Infrastructure , and Sciences categories account for over 85% of popular packages in either SC and TensorFlow and PyTorch SC have developed specializations on Infrastructure and Applications packages, respectively. We employ the Leiden community detection algorithm and detect 131 and 100 clusters in the two SCs. The clusters mainly exhibit four shapes: Arrow, Star, Tree, and Forest with increasing dependency complexity. Most clusters are Arrow or Star, while Tree and Forest clusters account for most packages (Tensorflow SC: 70.7%, PyTorch SC: 92.9%). We identify three groups of reasons why packages disengage from the SC (i.e., remove the DL framework and its dependents from their installation dependencies): dependency issues, functional improvements, and ease of installation. The most common reason in TensorFlow SC is dependency incompatibility and in PyTorch SC is to simplify functionalities and reduce installation size. Our study provides rich implications for DL framework vendors, researchers, and practitioners on the maintenance and dependency management practices of PyPI DL SCs.

Funder

National Natural Science Foundation of China

Publisher

Association for Computing Machinery (ACM)

Reference83 articles.

1. 2013. PEP 440-Version Identification and Dependency Specification. https://peps.python.org/pep-0440/(Accessed on 02/02/2023).

2. 2015. PEP 508 – Dependency Specification for Python Software Packages. https://peps.python.org/pep-0508(Accessed on 2022-08-04).

3. 2019. Microsoft/CNTK: Microsoft Cognitive Toolkit (CNTK) an Open Source Deep-Learning Toolkit. https://github.com/microsoft/CNTK#disclaimer(Accessed on 12/15/2022).

4. 2019. Preferred Networks Migrates its Deep Learning Research Platform to PyTorch - Preferred Networks Inc.https://www.preferred.jp/en/news/pr20191205/(Accessed on 12/15/2022).

5. 2021. SolarWinds Orion Security Breach: A Shift in the Software Supply Chain Paradigm. https://snyk.io/blog/solarwinds-orion-security-breach-a-shift-in-the-software-supply-chain-paradigm/(Accessed on 2022-08-02).

Cited by 2 articles. 订阅此论文施引文献 订阅此论文施引文献,注册后可以免费订阅5篇论文的施引文献,订阅后可以查看论文全部施引文献

1. PyRadar: Towards Automatically Retrieving and Validating Source Code Repository Information for PyPI Packages;Proceedings of the ACM on Software Engineering;2024-07-12

2. Sustainability Forecasting for Deep Learning Packages;2024 IEEE International Conference on Software Analysis, Evolution and Reengineering (SANER);2024-03-12

同舟云学术

1.学者识别学者识别

2.学术分析学术分析

3.人才评估人才评估

"同舟云学术"是以全球学者为主线,采集、加工和组织学术论文而形成的新型学术文献查询和分析系统,可以对全球学者进行文献检索和人才价值评估。用户可以通过关注某些学科领域的顶尖人物而持续追踪该领域的学科进展和研究前沿。经过近期的数据扩容,当前同舟云学术共收录了国内外主流学术期刊6万余种,收集的期刊论文及会议论文总量共计约1.5亿篇,并以每天添加12000余篇中外论文的速度递增。我们也可以为用户提供个性化、定制化的学者数据。欢迎来电咨询!咨询电话:010-8811{复制后删除}0370

www.globalauthorid.com

TOP

Copyright © 2019-2024 北京同舟云网络信息技术有限公司
京公网安备11010802033243号  京ICP备18003416号-3