Affiliation:
1. Faculty of Engineering, Tokyo University of Science , 6-3-1 Niijuku, Katsushika-ku , Tokyo, Japan
2. Faulty of Culture and Information Science, Doshisha University , 1-3 Tatara Miyakodani, Kyotanabe-shi , Kyoto-fu, Japan
Abstract
Abstract
Word length refers to a feature that is extracted from texts and used to characterize authorial style; it was quantitatively demonstrated by Mendenhall (Mendenhall, T. C., 1887, The characteristics curves of composition. Science, IX: 237–49). Many similar features for describing authorial style have been proposed; however, research indicates that compared with other features, word length identifies authors with lower accuracy. This study proposes a feature, referred to as c-wordL, to improve the accuracy of authorship attribution in texts through the classification of words into several types by following the part-of-speech (POS) tags and combining these types with the word length data. The proposed method was tested using 200 literary texts from ten different authors in Japanese, English, and Chinese. The results indicated that c-wordL was more accurate than the existing word length-based features and provided useful information that word unigrams and POS tag bigrams could not measure. In addition, the ease of interpretation of different types of features was discussed. In summary, c-wordL outperformed the existing superior features in explaining the distinct writing styles and identifying the authors.
Publisher
Oxford University Press (OUP)
Subject
Computer Science Applications,Linguistics and Language,Language and Linguistics,Information Systems
Reference34 articles.
1. Scalability issues in authorship attribution;Argamon;Literary and Linguistic Computing,2012
2. How to measure word length in spoken and written Chinese;Chen;Journal of Quantitative Linguistics,2016
3. Quantifying evolution of short and long-range correlations in Chinese narrative texts across 2000 years;Chen;Complexity, Hindawi,2018
4. Comparison of word length distributions in spoken and written Chinese;Chen;Open Access Library Journal,2018
5. Approaching word length distribution via level spectra;Deng;Physica A: Statistical Mechanics and its Applications,2017