UrduAI: Writeprints for Urdu Authorship Identification-Reference-Cited by-同舟云学术

UrduAI: Writeprints for Urdu Authorship Identification

Published:2022-03-31 Issue:2 Volume:21 Page:1-18
ISSN:2375-4699
Container-title:ACM Transactions on Asian and Low-Resource Language Information Processing
language:en
Short-container-title:ACM Trans. Asian Low-Resour. Lang. Inf. Process.

Author:

Sarwar Raheem¹,Hassan Saeed-Ul²

Affiliation:

1. Research Group in Computational Linguistics, Research Institute of Information and Language Processing, University of Wolverhampton, Wolverhampton, Midlands, United Kingdom

2. Department of Computer Science, Information Technology University, Lahore, Punjab, Pakistan

Abstract

The authorship identification task aims at identifying the original author of an anonymous text sample from a set of candidate authors. It has several application domains such as digital text forensics and information retrieval. These application domains are not limited to a specific language. However, most of the authorship identification studies are focused on English and limited attention has been paid to Urdu. However, existing Urdu authorship identification solutions drop accuracy as the number of training samples per candidate author reduces and when the number of candidate authors increases. Consequently, these solutions are inapplicable to real-world cases. Moreover, due to the unavailability of reliable POS taggers or sentence segmenters, all existing authorship identification studies on Urdu text are limited to the word n-grams features only. To overcome these limitations, we formulate a stylometric feature space, which is not limited to the word n-grams feature only. Based on this feature space, we use an authorship identification solution that transforms each text sample into a point set, retrieves candidate text samples, and relies on the nearest neighbors classifier to predict the original author of the anonymous text sample. To evaluate our solution, we create a significantly larger corpus than existing studies and conduct several experimental studies that show that our solution can overcome the limitations of existing studies and report an accuracy level of 94.03%, which is higher than all previous authorship identification works.

Publisher

Association for Computing Machinery (ACM)

Subject

General Computer Science

Link

https://dl.acm.org/doi/pdf/10.1145/3476467

Reference56 articles.

1. Arabic authorship attribution: An extensive study on Twitter posts;Altakrori Malik H.;ACM Trans. Asian Low-Resour. Lang. Inf. Process.,2019

2. An Empirical Study on Forensic Analysis of Urdu Text Using LDA-Based Authorship Attribution

3. Role of discourse information in Urdu sentiment classification: A rule-based method and machine-learning technique;Awais Muhammad;ACM Trans. Asian Low Resour. Lang. Inf. Process.,2019

4. Nearest neighbor classification from multiple feature subsets

Cited by 8 articles. 订阅此论文施引文献订阅此论文施引文献，注册后可以免费订阅5篇论文的施引文献，订阅后可以查看论文全部施引文献

1. Crossing Linguistic Barriers: Authorship Attribution in Sinhala Texts;ACM Transactions on Asian and Low-Resource Language Information Processing;2024-05-10

2. AGI-P: A Gender Identification Framework for Authorship Analysis Using Customized Fine-Tuning of Multilingual Language Model;IEEE Access;2024

3. Poet Attribution of Urdu Ghazals using Deep Learning;2023 3rd International Conference on Artificial Intelligence (ICAI);2023-02-22

4. Autoencoder-Based Feature Extraction for Identifying Hate Speech Spreaders in Social Media;IEEE Transactions on Computational Social Systems;2023

5. Translator attribution for Arabic using machine learning;Digital Scholarship in the Humanities;2022-10-13