Author:
Strukova Sofia,Ruipérez-Valiente José A.,Gómez Mármol Félix
Abstract
AbstractThe irreplaceable key to the triumph of Question & Answer (Q & A) platforms is their users providing high-quality answers to the challenging questions posted across various topics of interest. From more than a decade, the expert finding problem attracted much attention in information retrieval research. Based on the encountered gaps in the expert identification across several Q & A portals, we inspect the feasibility of identifying data science experts in Reddit. Our method is based on the manual coding results where two data science experts labelled not only expert and non-expert comments, but also out-of-scope comments, which is a novel contribution to the literature, enabling the identification of more groups of comments across web portals. We present a semi-supervised approach which combines 1113 labelled comments with 100,226 unlabelled comments during training. We proved that it is possible to develop models that can identify expert, non-expert and out-of-scope comments peaking the AUC score at 0.93, accuracy at 0.83, MAE at 0.15 degrees and R2 score at 0.69. The proposed model uses the activity behaviour of every user, including Natural Language Processing (NLP), crowdsourced and user feature sets. We conclude that the NLP and user feature sets contribute the most to the better identification of these three classes. It means that this method can generalise well within the domain. Finally, we make a novel contribution by presenting different types of users in Reddit, which opens many future research directions.
Publisher
Springer Science and Business Media LLC
Subject
Computer Networks and Communications,Software
Reference54 articles.
1. Razeeth, M., Kariapper, R., Pirapuraj, P., Nafrees, A., Rishan, U., Nusrath Ali, S.: E-learning at home vs traditional learning among higher education students: a survey based analysis (2019)
2. Strukova, S., Ruipérez-Valiente, J.A., Mármol, F.G.: A survey on data-driven evaluation of competencies and capabilities across multimedia environments. Int. J. Interact. Multi. Artif. Intell. (2022). https://doi.org/10.9781/ijimai.2022.10.004
3. Aabdelaziz: Best & Most Popular Forums, Message Boards & Online Communities. https://it-maniacs.com/best-and-most-popular-forums-message-boards-and-online-communities-top-30/. Accessed 10 Feb 2022 (2021)
4. Ansari, N., Sharma, R.: Identifying semantically duplicate questions using data science approach: A quora case study. arXiv preprint arXiv:2004.11694 (2020). https://doi.org/10.48550/arXiv.2004.11694
5. Rogers, A., Gardner, M., Augenstein, I.: Qa dataset explosion: a taxonomy of nlp resources for question answering and reading comprehension. ACM Comput. Surv. (2023). https://doi.org/10.1145/3560260
Cited by
1 articles.
订阅此论文施引文献
订阅此论文施引文献,注册后可以免费订阅5篇论文的施引文献,订阅后可以查看论文全部施引文献