Affiliation:
1. Punjab University College of Information Technology, Lahore, Pakistan
2. Department of Data Science, University of the Punjab, Lahore
3. Department of Computer Science, Islamia University Bahawalpur, Bahawalpur
4. Department of Computer Science, University of New South Wales, Canberra
Abstract
Authorship attribution refers to examining the writing style of authors to determine the likelihood of the original author of a document from a given set of potential authors. Due to the wide range of authorship attribution applications, a plethora of studies have been conducted for various Western, as well as Asian, languages. However, authorship attribution research in the Urdu language has just begun, although Urdu is widely acknowledged as a prominent South Asian language. Furthermore, the existing studies on authorship attribution in Urdu have addressed a considerably easier problem of having less than 20 candidate authors, which is far from the real-world settings. Therefore, the findings from these studies may not be applicable to the real-world settings. To that end, we have made three key contributions: First, we have developed a large authorship attribution corpus for Urdu, which is a low-resource language. The corpus is composed of over 2.6 million tokens and 21,938 news articles by 94 authors, which makes it a closer substitute to the real-world settings. Second, we have analyzed hundreds of stylometry features used in the literature to identify 194 features that are applicable to the Urdu language and developed a taxonomy of these features. Finally, we have performed 66 experiments using two heterogeneous datasets to evaluate the effectiveness of four traditional and three deep learning techniques. The experimental results show the following: (a) Our developed corpus is many folds larger than the existing corpora, and it is more challenging than its counterparts for the authorship attribution task, and (b) Convolutional Neutral Networks is the most effective technique, as it achieved a nearly perfect F1 score of 0.989 for an existing corpus and 0.910 for our newly developed corpus.
Publisher
Association for Computing Machinery (ACM)
Cited by
4 articles.
订阅此论文施引文献
订阅此论文施引文献,注册后可以免费订阅5篇论文的施引文献,订阅后可以查看论文全部施引文献
1. Demystification and Actualisation of Data Saturation in Qualitative Research Through Thematic Analysis;International Journal of Qualitative Methods;2024-01
2. Albanian Authorship Attribution Model;2023 12th Mediterranean Conference on Embedded Computing (MECO);2023-06-06
3. Poet Attribution of Urdu Ghazals using Deep Learning;2023 3rd International Conference on Artificial Intelligence (ICAI);2023-02-22
4. A3C: Albanian Authorship Attribution Corpus;Springer Proceedings in Business and Economics;2023