An Optimal Feature Set for Stylometry-based Style Change detection at Document and Sentence Level

Author:

Vivian Oloo 1,Lilian D. Wanzare 1,Calvins Otieno 1

Affiliation:

1. Department of Computer Science, Maseno University/Kisumu, Kenya

Abstract

Writing style change detection models focus on determining the number of authors of documents with or without known authors. Determining the exact number of authors contributing in writing a document particularly when the authors contribute short texts in form of a sentence is still challenging because of the lack of standardized feature sets able to discriminate between the works of authors. Therefore, the task of identifying the best feature set for all the tasks of the writing style change detection is still considered important. This paper sought to determine the best feature set for the writing style change detection tasks; separating documents with several style changes (multi-authorship) from documents without any style changes (single-authorship), and determining the number and location of style changes in the case of multi-authorship. We performed exploratory research on existing stylometric features to determine the best document level and sentence level features. Document level features were extracted and used to separate single authored from multi-authored documents, while sentence level features were used to answer the question of determining the number of style changes To answer this question, we trained a random forest classifier to rank document level features and sentence level features separately, and applied an ablation test on the top 15 sentence level features using k-means clustering algorithm to confirm the effect of these features on model performance. The study found out that the best document level feature set for separating documents with and without style change was provided by an ensemble of features including number of sentence repetitions (num_sentence_repetitions) as the most determinant feature, 5-grams, 4-grams, Special_character, sentence_begin_lower, sentence_begin_upper, diversity, automated_readability_index, parenthesis_count, first_word_uppercase, lensear_write_formula, dale_chall_readability, difficult_words, type_token_ratio. These were the top ranked features in experiment one. On the other hand, the top fifteen sentence level features based on feature ranks using random forest classifier were diversity, dale_chall_readability grade, check_available_vowel, flesch_kincaid grade, parenthesis_count, colon_count, verbs, bigrams, alphabets, personal pronouns, coordinating conjunctions, interjections, modals, type_token ratio and punctuations_count. Consequently, the optimal feature set for determining the number of style changes in documents was considered based on the results of the ablation study on the top fifteen sentence level features, and was provided by an ensemble of features including personal pronouns, check_available_vowels, punctuations_counts, parenthesis count, coordinating conjunctions and colon count.

Publisher

Technoscience Academy

Subject

General Medicine

Reference45 articles.

1. E. Zangerle, M. Mayerl, G. Specht, M. Potthast, and B. Stein, “Overview of the Style Change Detection Task at PAN 2020,” CEUR Workshop Proc., vol. 2696, no. September, pp. 9–12, 2020.

2. E. Zangerle, M. Tschuggnall, G. Specht, B. Stein, and M. Potthast, “Overview of the Style Change Detection Task at PAN 2019,” no. September, pp. 9–12, 2019.

3. H. Alberts, “Author clustering with the aid of a simple distance measure: Notebook for PAN at CLEF 2017,” CEUR Workshop Proc., vol. 1866, 2017.

4. S. Alshamasi and M. Menai, “Ensemble-Based Clustering for Writing Style Change Detection in Multi-Authored Textual Documents,” CEUR Workshop Proc., vol. 3180, pp. 2357–2374, 2022.

5. H. Gómez-Adorno, J. P. Posadas-Duran, G. Ríos-Toledo, G. Sidorov, and G. Sierra, “Stylometry-based approach for detecting writing style changes in literary texts,” Comput. y Sist., vol. 22, no. 1, pp. 47–53, 2018, doi: 10.13053/CyS-22-1-2882.

同舟云学术

1.学者识别学者识别

2.学术分析学术分析

3.人才评估人才评估

"同舟云学术"是以全球学者为主线,采集、加工和组织学术论文而形成的新型学术文献查询和分析系统,可以对全球学者进行文献检索和人才价值评估。用户可以通过关注某些学科领域的顶尖人物而持续追踪该领域的学科进展和研究前沿。经过近期的数据扩容,当前同舟云学术共收录了国内外主流学术期刊6万余种,收集的期刊论文及会议论文总量共计约1.5亿篇,并以每天添加12000余篇中外论文的速度递增。我们也可以为用户提供个性化、定制化的学者数据。欢迎来电咨询!咨询电话:010-8811{复制后删除}0370

www.globalauthorid.com

TOP

Copyright © 2019-2024 北京同舟云网络信息技术有限公司
京公网安备11010802033243号  京ICP备18003416号-3