Same or Different? Diff-Vectors for Authorship Analysis

Author:

Corbara Silvia1ORCID,Moreo Alejandro2ORCID,Sebastiani Fabrizio2ORCID

Affiliation:

1. Scuola Normale Superiore, Pisa, Italy

2. Istituto di Scienza e Tecnologie dell’Informazione, Consiglio Nazionale delle Ricerche, Pisa, Italy

Abstract

In this article, we investigate the effects on authorship identification tasks (including authorship verification, closed-set authorship attribution, and closed-set and open-set same-author verification) of a fundamental shift in how to conceive the vectorial representations of documents that are given as input to a supervised learner. In “classic” authorship analysis, a feature vector represents a document, the value of a feature represents (an increasing function of) the relative frequency of the feature in the document, and the class label represents the author of the document. We instead investigate the situation in which a feature vector represents an unorderedpairof documents, the value of a feature represents the absolute difference in the relative frequencies (or increasing functions thereof) of the feature in the two documents, and the class label indicates whether the two documents are from the same author or not. This latter (learner-independent) type of representation has been occasionally used before, but has never been studied systematically. We argue that it is advantageous, and that, in some cases (e.g., authorship verification), it provides a much larger quantity of information to the training process than the standard representation. The experiments that we carry out on several publicly available datasets (among which one that we here make available for the first time) show that feature vectors representing pairs of documents (that we here callDiff-Vectors) bring about systematic improvements in the effectiveness of authorship identification tasks, and especially so when training data are scarce (as it is often the case in real-life authorship identification scenarios). Our experiments tackle same-author verification, authorship verification, and closed-set authorship attribution; while DVs are naturally geared for solving the 1st, we also provide two novel methods for solving the 2nd and 3rd that use a solver for the 1st as a building block. The code to reproduce our experiments is open-source and available online.1

Funder

European Commission

Italian Ministry of University and Research under the NextGenerationEU program

Publisher

Association for Computing Machinery (ACM)

Subject

General Computer Science

Reference53 articles.

1. Charu C. Aggarwal. 2014. Instance-based learning: A survey. In Data Classification: Algorithms and Applications. Charu C. Aggarwal (Ed.), CRC Press, London, UK, 157–185.

2. Shlomo Argamon and Patrick Juola. 2011. Overview of the international authorship identification competition at PAN 2011. In Proceedings of the Working Notes of the 2011 Conference and Labs of the Evaluation Forum (CLEF 2011). Amsterdam, NL.

3. Automatically profiling the author of an anonymous text

4. Alberto Bartoli, Alex Dagri, Andrea De Lorenzo, Eric Medvet, and Fabiano Tarlao. 2015. An author verification approach based on differential features. In Proceedings of the Working Notes of the 2015 Conference and Labs of the Evaluation Forum (CLEF 2015). Toulouse, FR.

5. The Puzzle of Basil’sEpistula 38: A Mathematical Approach to a Philological Problem

Cited by 2 articles. 订阅此论文施引文献 订阅此论文施引文献,注册后可以免费订阅5篇论文的施引文献,订阅后可以查看论文全部施引文献

同舟云学术

1.学者识别学者识别

2.学术分析学术分析

3.人才评估人才评估

"同舟云学术"是以全球学者为主线,采集、加工和组织学术论文而形成的新型学术文献查询和分析系统,可以对全球学者进行文献检索和人才价值评估。用户可以通过关注某些学科领域的顶尖人物而持续追踪该领域的学科进展和研究前沿。经过近期的数据扩容,当前同舟云学术共收录了国内外主流学术期刊6万余种,收集的期刊论文及会议论文总量共计约1.5亿篇,并以每天添加12000余篇中外论文的速度递增。我们也可以为用户提供个性化、定制化的学者数据。欢迎来电咨询!咨询电话:010-8811{复制后删除}0370

www.globalauthorid.com

TOP

Copyright © 2019-2024 北京同舟云网络信息技术有限公司
京公网安备11010802033243号  京ICP备18003416号-3