Authorship Attribution on Short Texts in the Slovenian Language-Reference-Cited by-同舟云学术

Authorship Attribution on Short Texts in the Slovenian Language

Published:2023-10-04 Issue:19 Volume:13 Page:10965
ISSN:2076-3417
Container-title:Applied Sciences
language:en
Short-container-title:Applied Sciences

Author:

Gabrovšek Gregor¹^ORCID,Peer Peter¹^ORCID,Emeršič Žiga¹^ORCID,Batagelj Borut¹^ORCID

Affiliation:

1. Faculty of Computer and Information Science, University of Ljubljana, SI-1000 Ljubljana, Slovenia

Abstract

The study investigates the task of authorship attribution on short texts in Slovenian using the BERT language model. Authorship attribution is the task of attributing a written text to its author, frequently using stylometry or computational techniques. We create five custom datasets for different numbers of included text authors and fine-tune two BERT models, SloBERTa and BERT Multilingual (mBERT), to evaluate their performance in closed-class and open-class problems with varying numbers of authors. Our models achieved an F1 score of approximately 0.95 when using the dataset with the comments of the top five users by the number of written comments. Training on datasets that include comments written by an increasing number of people results in models with a gradually decreasing F1 score. Including out-of-class comments in the evaluation decreases the F1 score by approximately 0.05. The study demonstrates the feasibility of using BERT models for authorship attribution in short texts in the Slovenian language.

Publisher

MDPI AG

Subject

Fluid Flow and Transfer Processes,Computer Science Applications,Process Chemistry and Technology,General Engineering,Instrumentation,General Materials Science

Link

https://www.mdpi.com/2076-3417/13/19/10965/pdf

Reference37 articles.

1. A survey of modern authorship attribution methods;Stamatatos;J. Am. Soc. Inf. Sci. Technol.,2009

2. Authorship attribution;Juola;Found. Trends Inf. Retr.,2006

3. Plagiarism and authorship analysis: Introduction to the special issue;Stamatatos;Lang. Resour. Eval.,2011

4. Theóphilo, A., Pereira, L.A., and Rocha, A. (2019, January 12–17). A needle in a haystack? Harnessing onomatopoeia and user-specific stylometrics for authorship attribution of micro-messages. Proceedings of the ICASSP IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), Brighton, UK.

5. Logar, N., Grčar, M., Brakus, M., Erjavec, T., Holdt, Š.A., and Krek, S. (2020). Corpora of the Slovenian Language Gigafida, Kres, ccGigafida and ccKRES: Construction, Content, Usage, Znanstvena Založba Filozofske Fakultete. (In Slovenian).