Sampling Techniques to Overcome Class Imbalance in a Cyberbullying Context

Author:

Colton David,Hofmann Markus

Abstract

<div data-canvas-width="705.3003252350338">The majority of datasets suffer from class imbalance where samples of a dominant class significantly outnumber the samples available for the minority class that is to be detected. Prediction and classification machine learning models work best when there are roughly equal numbers of each class type. This paper explores sampling techniques that can be used to overcome this class imbalance problem in a cyberbullying context. A newly classified cyberbullying dataset, including detailed descriptions of the criteria used in its classification, was used to examine the feasibility of applying text mining techniques, to automate the detection of cyberbullying text when the dataset shows a significant class imbalance between the positive, cyberbullying, sample and the negative, not cyberbullying, samples. In this paper, we will investigate if oversampling the minority positive class or undersampling the majority negative class affects the performance of a prediction model. A compromise solution where the positive class is partially oversampled, and the negative class is partially undersampled is also examined. Although not strictly a class imbalance solution, sampling using the most frequently observed features was also explored.</div><p> </p>

Publisher

Universitat Politecnica de Valencia

Subject

General Earth and Planetary Sciences,General Environmental Science

Reference26 articles.

1. Cardie, Claire. 1997. "Improving minority class prediction using case-specific feature weights." Proceedings of the Fourteenth International Conference on Machine Learning. Morgan Kaufmann. 57-65.

2. Chan, Philip K., and Salvatore J. Stolfo. 1998. "Toward Scalable Learning with Non-uniform Class and Cost Distributions: A Case Study in Credit Card Fraud Detection." In Proceedings of the Fourth International Conference on Knowledge Discovery and Data Mining. AAAI Press. 164-168.

3. Chawla, Nitesh V. and Bowyer, Kevin W. and Hall, Lawrence O. and Kegelmeyer, W. Philip. 2002. "SMOTE: Synthetic Minority Over-sampling Technique." Journal of Artificial Intelligence Research. 321-357. https://doi.org/10.1613/jair.953

4. Chen, Ying, Yilu Zhou, Sencun Zhu, and Heng Xu. 2012. "Detecting Offensive Language in Social Media to Protect Adolescent Online Safety." Security, Risk and Trust (PASSAT), 2012 International Conference on and 2012 International Confernece on Social Computing (SocialCom). IEEE. 71-80. https://doi.org/10.1109/SocialCom-PASSAT.2012.55

5. Cionnaith, Fiachra Ó. 2012. Third suicide in weeks linked to cyberbullying. Accessed 03 14, 2019. http://www.irishexaminer.com/ireland/third-suicide-in-weeks-linked-to-cyberbullying-212271.html.

同舟云学术

1.学者识别学者识别

2.学术分析学术分析

3.人才评估人才评估

"同舟云学术"是以全球学者为主线,采集、加工和组织学术论文而形成的新型学术文献查询和分析系统,可以对全球学者进行文献检索和人才价值评估。用户可以通过关注某些学科领域的顶尖人物而持续追踪该领域的学科进展和研究前沿。经过近期的数据扩容,当前同舟云学术共收录了国内外主流学术期刊6万余种,收集的期刊论文及会议论文总量共计约1.5亿篇,并以每天添加12000余篇中外论文的速度递增。我们也可以为用户提供个性化、定制化的学者数据。欢迎来电咨询!咨询电话:010-8811{复制后删除}0370

www.globalauthorid.com

TOP

Copyright © 2019-2024 北京同舟云网络信息技术有限公司
京公网安备11010802033243号  京ICP备18003416号-3