Affiliation:
1. Macquarie University, Australia
2. The University of Melbourne and IBM Research Australia, Australia
3. IBM Research Australia, Australia
4. University of Oxford, Oxford, UK
Abstract
Community-based Question Answering (CQA) websites are attracting increasing numbers of users and contributors in recent years. However, duplicate questions frequently occur in CQA websites and are currently manually identified by the moderators. Automatic duplicate detection, on one hand, alleviates this laborious effort for moderators before taking close actions, and, on the other hand, helps question issuers quickly find answers. A number of studies have looked into related problems, but very limited works target Duplicate Detection in Programming CQA (PCQA), a branch of CQA that is dedicated to programmers. Existing works framed the task as a supervised learning problem on the question pairs and relied on only textual features. Moreover, the issue of selecting candidate duplicates from large volumes of historical questions is often un-addressed. To tackle these issues, we model duplicate detection as a two-stage “ranking-classification” problem over question pairs. In the first stage, we rank the historical questions according to their similarities to the newly issued question and select the top ranked ones as candidates to reduce the search space. In the second stage, we develop novel features that capture both textual similarity and latent semantics on question pairs, leveraging techniques in deep learning and information retrieval literature. Experiments on real-world questions about multiple programming languages demonstrate that our method works very well; in some cases, up to 25% improvement compared to the state-of-the-art benchmarks.
Funder
Australian Research Council (ARC) Future Fellowship
Publisher
Association for Computing Machinery (ACM)
Subject
Computer Networks and Communications
Cited by
19 articles.
订阅此论文施引文献
订阅此论文施引文献,注册后可以免费订阅5篇论文的施引文献,订阅后可以查看论文全部施引文献
1. Refining GPT-3 Embeddings with a Siamese Structure for Technical Post Duplicate Detection;2024 IEEE International Conference on Software Analysis, Evolution and Reengineering (SANER);2024-03-12
2. Looking for related posts on GitHub discussions;PeerJ Computer Science;2023-11-09
3. Duplicate Question Retrieval and Confirmation Time Prediction in Software Communities;Proceedings of the International Conference on Advances in Social Networks Analysis and Mining;2023-11-06
4. Analyzing Techniques for Duplicate Question Detection on Q&A Websites for Game Developers;Empirical Software Engineering;2022-12-08
5. Exploring the Feasibility of Transformer Based Models on Question Relatedness;2022 IEEE 24th Int Conf on High Performance Computing & Communications; 8th Int Conf on Data Science & Systems; 20th Int Conf on Smart City; 8th Int Conf on Dependability in Sensor, Cloud & Big Data Systems & Application (HPCC/DSS/SmartCity/DependSys);2022-12