Abstract
The aim of this study is to find parameters that can be used for classification of not very long texts, for example, by author, genre, etc. We go through various known parameters and analyze to what extent they are useful for the intended purposes. We also suggest some improvements that need to be checked further. We calculate the values of parameters at various points of text comprising N tokens (running words) counted from the beginning of text. As parameters with prospects of author and/or language attribution we identify, in particular, the h-point scaling coefficient, Yule’s K, relative repeat rate, and the fraction of dis legomena. These parameters demonstrate quite stable behavior in N. Another set includes scaling exponents of parameters with respect to N. Certain modifications are suggested for Lambda and entropy introducing logarithmic corrections being powers of ln N. The results are applicable for texts of thousands to tens of thousand words.
Publisher
International Quantitative Linguistics Association
Subject
Applied Mathematics,Linguistics and Language,Language and Linguistics
Cited by
2 articles.
订阅此论文施引文献
订阅此论文施引文献,注册后可以免费订阅5篇论文的施引文献,订阅后可以查看论文全部施引文献