Affiliation:
1. School of Foreign Languages, South China University of Technology, Guangzhou 510641, China
2. School of Mathematics, South China University of Technology, Guangzhou 510641, China
Abstract
<abstract>
<p>Recent decades have witnessed the rapid development of literary studies on gender and writing style. One of the common limitations of previous studies is that they analyze only a few texts, which some researchers have already pointed out. In this study, we attempt to find the features that best facilitate the classification of texts by authorial gender. Based on a corpus of 1113 classical fictions from the early 19<sup>th</sup> century to the early 20<sup>th</sup> century. Eight algorithms, including SVM, random forest, decision tree, AdaBoost, logistic regression, K-nearest neighbors, gradient boosting and XGBoost, are used to automatically select the features that are most useful for properly categorizing a text. We find that word frequency is the most important predictor for identifying authorial gender in classical fictions, achieving an accuracy rate of 92%. We also find that nationhood is not particularly impactful when dealing with authorial gender differences in classical fictions, as genderlectal variation is 'universal' in the English-speaking world.</p>
</abstract>
Publisher
American Institute of Mathematical Sciences (AIMS)
Subject
Applied Mathematics,Computational Mathematics,General Agricultural and Biological Sciences,Modeling and Simulation,General Medicine