Duplicate Detection in Programming Question Answering Communities

Author:

Zhang Wei Emma1ORCID,Sheng Quan Z.1,Lau Jey Han2,Abebe Ermyas3,Ruan Wenjie4

Affiliation:

1. Macquarie University, Australia

2. The University of Melbourne and IBM Research Australia, Australia

3. IBM Research Australia, Australia

4. University of Oxford, Oxford, UK

Abstract

Community-based Question Answering (CQA) websites are attracting increasing numbers of users and contributors in recent years. However, duplicate questions frequently occur in CQA websites and are currently manually identified by the moderators. Automatic duplicate detection, on one hand, alleviates this laborious effort for moderators before taking close actions, and, on the other hand, helps question issuers quickly find answers. A number of studies have looked into related problems, but very limited works target Duplicate Detection in Programming CQA (PCQA), a branch of CQA that is dedicated to programmers. Existing works framed the task as a supervised learning problem on the question pairs and relied on only textual features. Moreover, the issue of selecting candidate duplicates from large volumes of historical questions is often un-addressed. To tackle these issues, we model duplicate detection as a two-stage “ranking-classification” problem over question pairs. In the first stage, we rank the historical questions according to their similarities to the newly issued question and select the top ranked ones as candidates to reduce the search space. In the second stage, we develop novel features that capture both textual similarity and latent semantics on question pairs, leveraging techniques in deep learning and information retrieval literature. Experiments on real-world questions about multiple programming languages demonstrate that our method works very well; in some cases, up to 25% improvement compared to the state-of-the-art benchmarks.

Funder

Australian Research Council (ARC) Future Fellowship

Publisher

Association for Computing Machinery (ACM)

Subject

Computer Networks and Communications

Reference44 articles.

1. Mining duplicate questions in stack overflow

2. An introduction to kernel and nearest-neighbor nonparametric regression;Altman Naomi S.;The American Statistician,1992

3. Probabilistic models of information retrieval based on measuring the divergence from randomness

4. Semantic Parsing via Paraphrasing

Cited by 19 articles. 订阅此论文施引文献 订阅此论文施引文献,注册后可以免费订阅5篇论文的施引文献,订阅后可以查看论文全部施引文献

1. Refining GPT-3 Embeddings with a Siamese Structure for Technical Post Duplicate Detection;2024 IEEE International Conference on Software Analysis, Evolution and Reengineering (SANER);2024-03-12

2. Looking for related posts on GitHub discussions;PeerJ Computer Science;2023-11-09

3. Duplicate Question Retrieval and Confirmation Time Prediction in Software Communities;Proceedings of the International Conference on Advances in Social Networks Analysis and Mining;2023-11-06

4. Analyzing Techniques for Duplicate Question Detection on Q&A Websites for Game Developers;Empirical Software Engineering;2022-12-08

5. Exploring the Feasibility of Transformer Based Models on Question Relatedness;2022 IEEE 24th Int Conf on High Performance Computing & Communications; 8th Int Conf on Data Science & Systems; 20th Int Conf on Smart City; 8th Int Conf on Dependability in Sensor, Cloud & Big Data Systems & Application (HPCC/DSS/SmartCity/DependSys);2022-12

同舟云学术

1.学者识别学者识别

2.学术分析学术分析

3.人才评估人才评估

"同舟云学术"是以全球学者为主线,采集、加工和组织学术论文而形成的新型学术文献查询和分析系统,可以对全球学者进行文献检索和人才价值评估。用户可以通过关注某些学科领域的顶尖人物而持续追踪该领域的学科进展和研究前沿。经过近期的数据扩容,当前同舟云学术共收录了国内外主流学术期刊6万余种,收集的期刊论文及会议论文总量共计约1.5亿篇,并以每天添加12000余篇中外论文的速度递增。我们也可以为用户提供个性化、定制化的学者数据。欢迎来电咨询!咨询电话:010-8811{复制后删除}0370

www.globalauthorid.com

TOP

Copyright © 2019-2024 北京同舟云网络信息技术有限公司
京公网安备11010802033243号  京ICP备18003416号-3