Automatic Identification of Similar Pull-Requests in GitHub’s Repositories Using Machine Learning-Reference-Cited by-同舟云学术

Automatic Identification of Similar Pull-Requests in GitHub’s Repositories Using Machine Learning

Published:2022-02-03 Issue:2 Volume:13 Page:73
ISSN:2078-2489
Container-title:Information
language:en
Short-container-title:Information

Author:

Eyal Salman Hamzeh^ORCID,Alshara Zakarea,Seriai Abdelhak-Djamel

Abstract

Context: In a social coding platform such as GitHub, a pull-request mechanism is frequently used by contributors to submit their code changes to reviewers of a given repository. In general, these code changes are either to add a new feature or to fix an existing bug. However, this mechanism is distributed and allows different contributors to submit unintentionally similar pull-requests that perform similar development activities. Similar pull-requests may be submitted to review in parallel time by different reviewers. This will cause redundant reviewing time and efforts. Moreover, it will complicate the collaboration process. Objective: Therefore, it is useful to assign similar pull-requests to the same reviewer to be able to decide which pull-request to choose in effective time and effort. In this article, we propose to group similar pull-requests together into clusters so that each cluster is assigned to the same reviewer or the same reviewing team. This proposal allows saving reviewing efforts and time. Method: To do so, we first extract descriptive textual information from pull-requests content to link similar pull-requests together. Then, we employ the extracted information to find similarities among pull-requests. Finally, machine learning algorithms (K-Means clustering and agglomeration hierarchical clustering algorithms) are used to group similar pull-requests together. Results: To validate our proposal, we have applied it to twenty popular repositories from public dataset. The experimental results show that the proposed approach achieved promising results according to the well-known metrics in this subject: precision and recall. Furthermore, it helps to save the reviewer time and effort. Conclusion: According to the obtained results, the K-Means algorithm achieves 94% and 91% average precision and recall values over all considered repositories, respectively, while agglomeration hierarchical clustering performs 93% and 98% average precision and recall values over all considered repositories, respectively. Moreover, the proposed approach saves reviewing time and effort on average between (67% and 91%) by K-Means algorithm and between (67% and 83%) by agglomeration hierarchical clustering algorithm.

Publisher

MDPI AG

Subject

Information Systems

Link

https://www.mdpi.com/2078-2489/13/2/73/pdf

Reference53 articles.

1. Redundancy, Context, and Preference: An Empirical Study of Duplicate Pull Requests in OSS Projects

2. An insight into the pull requests of GitHub

3. Feature-Level Change Impact Analysis Using Formal Concept Analysis

4. Duplicate Pull Request Detection

Cited by 5 articles. 订阅此论文施引文献订阅此论文施引文献，注册后可以免费订阅5篇论文的施引文献，订阅后可以查看论文全部施引文献

1. PR-DupliChecker: detecting duplicate pull requests in Fork-based workflows;International Journal of System Assurance Engineering and Management;2024-06-19

2. AI-based clustering of similar issues in GitHub’s repositories;Journal of Computer Languages;2024-03

3. Leveraging a combination of machine learning and formal concept analysis to locate the implementation of features in software variants;Information and Software Technology;2023-12

4. Extracting Insights from Big Source Code Repositories with Automatic Clustering of Projects by File Names and Types;2023 International Conference on Smart Applications, Communications and Networking (SmartNets);2023-07-25

5. Analysis of RSS Patterns to Detect Rogue Access Points;2022 International Conference on Emerging Trends in Computing and Engineering Applications (ETCEA);2022-11