How Different Text-Preprocessing Techniques using the Bert Model Affect the Gender Profiling of Authors-Reference-Cited by-同舟云学术

How Different Text-Preprocessing Techniques using the Bert Model Affect the Gender Profiling of Authors

Published:2021-09-25 Issue: Volume: Page:
ISSN:
Container-title:Advances in Machine Learning
language:
Short-container-title:

Author:

Alzahrani Esam,Jololian Leon

Abstract

Forensic author profiling plays an important role in indicating possible profiles for suspects. Among the many automated solutions recently proposed for author profiling, transfer learning outperforms many other state-of-the-art techniques in natural language processing. Nevertheless, the sophisticated technique has yet to be fully exploited for author profiling. At the same time, whereas current methods of author profiling, all largely based on features engineering, have spawned significant variation in each model used, transfer learning usually requires a preprocessed text to be fed into the model. We reviewed multiple references in the literature and determined the most common preprocessing techniques associated with authors' genders profiling. Considering the variations in potential preprocessing techniques, we conducted an experimental study that involved applying five such techniques to measure each technique’s effect while using the BERT model, chosen for being one of the most-used stock pretrained models. We used the Hugging face transformer library to implement the code for each preprocessing case. In our five experiments, we found that BERT achieves the best accuracy in predicting the gender of the author when no preprocessing technique is applied. Our best case achieved 86.67% accuracy in predicting the gender of authors.

Publisher

Academy and Industry Research Collaboration Center (AIRCC)

Cited by 9 articles. 订阅此论文施引文献订阅此论文施引文献，注册后可以免费订阅5篇论文的施引文献，订阅后可以查看论文全部施引文献

1. Empowering Indonesian internet users: An approach to counter online toxicity and enhance digital well-being;Intelligent Systems with Applications;2024-06

2. Unsupervised Production Machinery Data Labeling Method Based on Natural Language Processing;2024 International Russian Smart Industry Conference (SmartIndustryCon);2024-03-25

3. Students’ Experiences and Challenges During the COVID-19 Pandemic: A Multi-method Exploration;Lecture Notes in Computer Science;2024

4. Experiments on IndoBERT Implementation for Detecting Multi-Label Hate Speech with Data Resampling through Synonym Replacement Method;2023 IEEE 8th International Conference on Recent Advances and Innovations in Engineering (ICRAIE);2023-12-02

5. Hate Speech Detection using CNN and BiGRU with Attention Mechanism on Twitter;2023 IEEE International Conference on Communication, Networks and Satellite (COMNETSAT);2023-11-23