Hapax remains: Regularity of low-frequency words in authorial texts

Author:

Faltýnek Dan1ORCID,Matlach Vladimír2ORCID

Affiliation:

1. Department of General Linguistics, Palacky University in Olomouc, Olomouc, Czech Republic

2. Department of General Linguistics, Palacky University in Olomouc, Olomouc, Czech Republic

Abstract

Abstract This article highlights the usual overlook in the literature of regular occurrences of low-frequency words (hapax legomena) in specific authors’ texts. This overlook arises from a linguistic assumption of non-systematic and context-dependent low-frequency word occurrences in extensive texts, and from the tendency of SVM methods to mark low-frequency words as irrelevant compared to the more frequent lexicon (e.g. Boukhaled, M. A. and Ganascia, J.-G. (2015). Using function words for authorship attribution: bag-of-words vs. sequential rules. In The 11th International Workshop on Natural Language Processing and Cognitive Science, October 2014, Venice, Italy. de Gruyter, Natural Language Processing and Cognitive Science Proceedings 2014, pp. 115–122.). Many approaches to authorship attribution are based on the n most frequent ‘function words’, which (1) are grammatically essential, frequent, and therefore included in each text; (2) are not affected by the topic of the text; and (3) reflect the unintentional linguistic activity of the author (Binongo, J. N. G. (2003). Who wrote the 15th book of Oz? An application of multivariate analysis to authorship attribution. Chance, 16(2): 9–17). Hapax legomena meet these conditions as well, except frequency (Baayen, H., van Halteren, H., and Tweedie, F. (1996). Outside the cave of shadows: using syntactic annotation to enhance authorship attribution. Literary and Linguistic Computing, 11(3): 121–32). We test the hypothesis that hapax legomena may work for purposes of authorship attribution based on selecting only hapaxes from whole texts (or randomly selected tokens of hapaxes) while using a specific pre-processed input (eigendecomposition of a cosine distance matrix) to the SVM classifier. This method evaluated the attribution of texts from fourteen Czech authors (yielding ninety-one pairs in total) and Evert, S., Proisl, T., Jannidis, F. et al. (2017). Understanding and explaining Delta measures for authorship attribution. Digital Scholarship in the Humanities, 32(2): 4–16 data set, and proved itself a suitable tool for identifying authors of previously unknown texts. Our method identifies a sparse network of regular occurrences of low-frequency words in different authors’ texts.

Funder

Sinophone Borderlands: Interaction at the Edges

Publisher

Oxford University Press (OUP)

Subject

Computer Science Applications,Linguistics and Language,Language and Linguistics,Information Systems

Reference52 articles.

1. Patterning of writing style evolution by means of dynamic similarity;Amelin;Pattern Recognition,2018

2. On de Saussure's principle of linearity and visualisation of language structures;Andres;Glottotheory,2009

3. Methodological note on the fractal analysis of texts;Andres;Journal of Quantitative Linguistics,2012

4. Prolegomena to Menzerath’s Law;Altmann;Glottometrika,1980

5. Outside the cave of shadows: using syntactic annotation to enhance authorship attribution;Baayen;Literary and Linguistic Computing,1996

同舟云学术

1.学者识别学者识别

2.学术分析学术分析

3.人才评估人才评估

"同舟云学术"是以全球学者为主线,采集、加工和组织学术论文而形成的新型学术文献查询和分析系统,可以对全球学者进行文献检索和人才价值评估。用户可以通过关注某些学科领域的顶尖人物而持续追踪该领域的学科进展和研究前沿。经过近期的数据扩容,当前同舟云学术共收录了国内外主流学术期刊6万余种,收集的期刊论文及会议论文总量共计约1.5亿篇,并以每天添加12000余篇中外论文的速度递增。我们也可以为用户提供个性化、定制化的学者数据。欢迎来电咨询!咨询电话:010-8811{复制后删除}0370

www.globalauthorid.com

TOP

Copyright © 2019-2024 北京同舟云网络信息技术有限公司
京公网安备11010802033243号  京ICP备18003416号-3