Parallel Stylometric Document Embeddings with Deep Learning Based Language Models in Literary Authorship Attribution-Reference-Cited by-同舟云学术

Parallel Stylometric Document Embeddings with Deep Learning Based Language Models in Literary Authorship Attribution

Published:2022-03-07 Issue:5 Volume:10 Page:838
ISSN:2227-7390
Container-title:Mathematics
language:en
Short-container-title:Mathematics

Author:

Škorić Mihailo^ORCID,Stanković Ranka^ORCID,Ikonić Nešić Milica^ORCID,Byszuk Joanna^ORCID,Eder Maciej^ORCID

Abstract

This paper explores the effectiveness of parallel stylometric document embeddings in solving the authorship attribution task by testing a novel approach on literary texts in 7 different languages, totaling in 7051 unique 10,000-token chunks from 700 PoS and lemma annotated documents. We used these documents to produce four document embedding models using Stylo R package (word-based, lemma-based, PoS-trigrams-based, and PoS-mask-based) and one document embedding model using mBERT for each of the seven languages. We created further derivations of these embeddings in the form of average, product, minimum, maximum, and l2 norm of these document embedding matrices and tested them both including and excluding the mBERT-based document embeddings for each language. Finally, we trained several perceptrons on the portions of the dataset in order to procure adequate weights for a weighted combination approach. We tested standalone (two baselines) and composite embeddings for classification accuracy, precision, recall, weighted-average, and macro-averaged F1-score, compared them with one another and have found that for each language most of our composition methods outperform the baselines (with a couple of methods outperforming all baselines for all languages), with or without mBERT inputs, which are found to have no significant positive impact on the results of our methods.

Funder

European Cooperation in Science and Technology

Publisher

MDPI AG

Subject

General Mathematics,Engineering (miscellaneous),Computer Science (miscellaneous)

Link

https://www.mdpi.com/2227-7390/10/5/838/pdf

Reference43 articles.

1. Conjectures on World Literature;Moretti;New Left Rev.,2000

2. Authorship analysis studies: A survey;El;Int. J. Comput. Appl.,2014

3. Stylometry for Noisy Medieval Data: Evaluating Paul Meyer’s Hagiographic Hypothesis;Camps;arXiv,2020

4. Plagiarism and authorship analysis: introduction to the special issue

Cited by 6 articles. 订阅此论文施引文献订阅此论文施引文献，注册后可以免费订阅5篇论文的施引文献，订阅后可以查看论文全部施引文献

1. Understanding writing style in social media with a supervised contrastively pre-trained transformer;Knowledge-Based Systems;2024-07

2. Significance of Single-Interval Discrete Attributes: Case Study on Two-Level Discretisation;Applied Sciences;2024-05-11

3. Importance of Characteristic Features and Their Form for Data Exploration;Entropy;2024-05-06

4. Authorship Attribution in Less-Resourced Languages: A Hybrid Transformer Approach for Romanian;Applied Sciences;2024-03-23

5. Transformer-Based Composite Language Models for Text Evaluation and Classification;Mathematics;2023-11-16