Authorship Attribution for a Resource Poor Language—Urdu

Author:

Nazir Zulqarnain1,Shahzad Khurram2,Malik Muhammad Kamran1,Anwar Waheed3,Bajwa Imran Sarwar3,Mehmood Khawar4

Affiliation:

1. Punjab University College of Information Technology, Lahore, Pakistan

2. Department of Data Science, University of the Punjab, Lahore

3. Department of Computer Science, Islamia University Bahawalpur, Bahawalpur

4. Department of Computer Science, University of New South Wales, Canberra

Abstract

Authorship attribution refers to examining the writing style of authors to determine the likelihood of the original author of a document from a given set of potential authors. Due to the wide range of authorship attribution applications, a plethora of studies have been conducted for various Western, as well as Asian, languages. However, authorship attribution research in the Urdu language has just begun, although Urdu is widely acknowledged as a prominent South Asian language. Furthermore, the existing studies on authorship attribution in Urdu have addressed a considerably easier problem of having less than 20 candidate authors, which is far from the real-world settings. Therefore, the findings from these studies may not be applicable to the real-world settings. To that end, we have made three key contributions: First, we have developed a large authorship attribution corpus for Urdu, which is a low-resource language. The corpus is composed of over 2.6 million tokens and 21,938 news articles by 94 authors, which makes it a closer substitute to the real-world settings. Second, we have analyzed hundreds of stylometry features used in the literature to identify 194 features that are applicable to the Urdu language and developed a taxonomy of these features. Finally, we have performed 66 experiments using two heterogeneous datasets to evaluate the effectiveness of four traditional and three deep learning techniques. The experimental results show the following: (a) Our developed corpus is many folds larger than the existing corpora, and it is more challenging than its counterparts for the authorship attribution task, and (b) Convolutional Neutral Networks is the most effective technique, as it achieved a nearly perfect F1 score of 0.989 for an existing corpus and 0.910 for our newly developed corpus.

Publisher

Association for Computing Machinery (ACM)

Subject

General Computer Science

Cited by 4 articles. 订阅此论文施引文献 订阅此论文施引文献,注册后可以免费订阅5篇论文的施引文献,订阅后可以查看论文全部施引文献

1. Demystification and Actualisation of Data Saturation in Qualitative Research Through Thematic Analysis;International Journal of Qualitative Methods;2024-01

2. Albanian Authorship Attribution Model;2023 12th Mediterranean Conference on Embedded Computing (MECO);2023-06-06

3. Poet Attribution of Urdu Ghazals using Deep Learning;2023 3rd International Conference on Artificial Intelligence (ICAI);2023-02-22

4. A3C: Albanian Authorship Attribution Corpus;Springer Proceedings in Business and Economics;2023

同舟云学术

1.学者识别学者识别

2.学术分析学术分析

3.人才评估人才评估

"同舟云学术"是以全球学者为主线,采集、加工和组织学术论文而形成的新型学术文献查询和分析系统,可以对全球学者进行文献检索和人才价值评估。用户可以通过关注某些学科领域的顶尖人物而持续追踪该领域的学科进展和研究前沿。经过近期的数据扩容,当前同舟云学术共收录了国内外主流学术期刊6万余种,收集的期刊论文及会议论文总量共计约1.5亿篇,并以每天添加12000余篇中外论文的速度递增。我们也可以为用户提供个性化、定制化的学者数据。欢迎来电咨询!咨询电话:010-8811{复制后删除}0370

www.globalauthorid.com

TOP

Copyright © 2019-2024 北京同舟云网络信息技术有限公司
京公网安备11010802033243号  京ICP备18003416号-3