Influence of Pre-Processing Strategies on the Performance of ML Classifiers Exploiting TF-IDF and BOW Features

Author:

Pimpalkar Amit Purushottam,Retna Raj R. Jeberson

Abstract

Data analytics and its associated applications have recently become impor-tant fields of study. The subject of concern for researchers now-a-days is a massive amount of data produced every minute and second as people con-stantly sharing thoughts, opinions about things that are associated with them. Social media info, however, is still unstructured, disseminated and hard to handle and need to be developed a strong foundation so that they can be utilized as valuable information on a particular topic. Processing such unstructured data in this area in terms of noise, co-relevance, emoticons, folksonomies and slangs is really quite challenging and therefore requires proper data pre-processing before getting the right sentiments. The dataset is extracted from Kaggle and Twitter, pre-processing performed using NLTK and Scikit-learn and features selection and extraction is done for Bag of Words (BOW), Term Frequency (TF) and Inverse Document Frequency (IDF) scheme. For polarity identification, we evaluated five different Machine Learning (ML) algorithms viz Multinomial Naive Bayes (MNB), Logistic Regression (LR), Decision Trees (DT), XGBoost (XGB) and Support Vector Machines (SVM). We have performed a comparative analysis of the success for these algorithms in order to decide which algorithm works best for the given data-set in terms of recall, accuracy, F1-score and precision. We assess the effects of various pre-processing techniques on two datasets; one with domain and other not. It is demonstrated that SVM classifier outperformed the other classifiers with superior evaluations of 73.12% and 94.91% for accuracy and precision respectively. It is also highlighted in this research that the selection and representation of features along with various pre-processing techniques have a positive impact on the performance of the classification.  The ultimate outcome indicates an improvement in sentiment classification and we noted that pre-processing approaches obviously suggest an improvement in the efficiency of the classifiers.

Publisher

Ediciones Universidad de Salamanca

Cited by 37 articles. 订阅此论文施引文献 订阅此论文施引文献,注册后可以免费订阅5篇论文的施引文献,订阅后可以查看论文全部施引文献

1. Empowering Indonesian internet users: An approach to counter online toxicity and enhance digital well-being;Intelligent Systems with Applications;2024-06

2. Big data in transportation: a systematic literature analysis and topic classification;Knowledge and Information Systems;2024-05-08

3. Transformer-Based Memes Generation Using Text and Image;Advances in Computational Intelligence and Robotics;2024-02-27

4. Sentiment Analysis Using an Ensemble Approach on Flipkart Societal Media Data;2024 IEEE International Students' Conference on Electrical, Electronics and Computer Science (SCEECS);2024-02-24

5. Ensemble Learning Techniques for Classifying Stressed and Unstressed Textual Data;2024 IEEE International Students' Conference on Electrical, Electronics and Computer Science (SCEECS);2024-02-24

同舟云学术

1.学者识别学者识别

2.学术分析学术分析

3.人才评估人才评估

"同舟云学术"是以全球学者为主线,采集、加工和组织学术论文而形成的新型学术文献查询和分析系统,可以对全球学者进行文献检索和人才价值评估。用户可以通过关注某些学科领域的顶尖人物而持续追踪该领域的学科进展和研究前沿。经过近期的数据扩容,当前同舟云学术共收录了国内外主流学术期刊6万余种,收集的期刊论文及会议论文总量共计约1.5亿篇,并以每天添加12000余篇中外论文的速度递增。我们也可以为用户提供个性化、定制化的学者数据。欢迎来电咨询!咨询电话:010-8811{复制后删除}0370

www.globalauthorid.com

TOP

Copyright © 2019-2024 北京同舟云网络信息技术有限公司
京公网安备11010802033243号  京ICP备18003416号-3