Affiliation:
1. Scuola Normale Superiore, Pisa, Italy
2. Istituto di Scienza e Tecnologie dell’Informazione, Consiglio Nazionale delle Ricerche, Pisa, Italy
Abstract
In this article, we investigate the effects on authorship identification tasks (including authorship verification, closed-set authorship attribution, and closed-set and open-set same-author verification) of a fundamental shift in how to conceive the vectorial representations of documents that are given as input to a supervised learner. In “classic” authorship analysis, a feature vector represents a document, the value of a feature represents (an increasing function of) the relative frequency of the feature in the document, and the class label represents the author of the document. We instead investigate the situation in which a feature vector represents an unorderedpairof documents, the value of a feature represents the absolute difference in the relative frequencies (or increasing functions thereof) of the feature in the two documents, and the class label indicates whether the two documents are from the same author or not. This latter (learner-independent) type of representation has been occasionally used before, but has never been studied systematically. We argue that it is advantageous, and that, in some cases (e.g., authorship verification), it provides a much larger quantity of information to the training process than the standard representation. The experiments that we carry out on several publicly available datasets (among which one that we here make available for the first time) show that feature vectors representing pairs of documents (that we here callDiff-Vectors) bring about systematic improvements in the effectiveness of authorship identification tasks, and especially so when training data are scarce (as it is often the case in real-life authorship identification scenarios). Our experiments tackle same-author verification, authorship verification, and closed-set authorship attribution; while DVs are naturally geared for solving the 1st, we also provide two novel methods for solving the 2nd and 3rd that use a solver for the 1st as a building block. The code to reproduce our experiments is open-source and available online.1
Funder
European Commission
Italian Ministry of University and Research under the NextGenerationEU program
Publisher
Association for Computing Machinery (ACM)
Reference53 articles.
1. Charu C. Aggarwal. 2014. Instance-based learning: A survey. In Data Classification: Algorithms and Applications. Charu C. Aggarwal (Ed.), CRC Press, London, UK, 157–185.
2. Shlomo Argamon and Patrick Juola. 2011. Overview of the international authorship identification competition at PAN 2011. In Proceedings of the Working Notes of the 2011 Conference and Labs of the Evaluation Forum (CLEF 2011). Amsterdam, NL.
3. Automatically profiling the author of an anonymous text
4. Alberto Bartoli, Alex Dagri, Andrea De Lorenzo, Eric Medvet, and Fabiano Tarlao. 2015. An author verification approach based on differential features. In Proceedings of the Working Notes of the 2015 Conference and Labs of the Evaluation Forum (CLEF 2015). Toulouse, FR.
5. The Puzzle of Basil’sEpistula 38: A Mathematical Approach to a Philological Problem
Cited by
2 articles.
订阅此论文施引文献
订阅此论文施引文献,注册后可以免费订阅5篇论文的施引文献,订阅后可以查看论文全部施引文献