Influence of Pre-Processing Strategies on the Performance of ML Classifiers Exploiting TF-IDF and BOW Features-Reference-Cited by-同舟云学术

Influence of Pre-Processing Strategies on the Performance of ML Classifiers Exploiting TF-IDF and BOW Features

Published:2020-06-18 Issue:2 Volume:9 Page:49-68
ISSN:2255-2863
Container-title:ADCAIJ: Advances in Distributed Computing and Artificial Intelligence Journal
language:
Short-container-title:ADCAIJ

Author:

Pimpalkar Amit Purushottam,Retna Raj R. Jeberson

Abstract

Data analytics and its associated applications have recently become impor-tant fields of study. The subject of concern for researchers now-a-days is a massive amount of data produced every minute and second as people con-stantly sharing thoughts, opinions about things that are associated with them. Social media info, however, is still unstructured, disseminated and hard to handle and need to be developed a strong foundation so that they can be utilized as valuable information on a particular topic. Processing such unstructured data in this area in terms of noise, co-relevance, emoticons, folksonomies and slangs is really quite challenging and therefore requires proper data pre-processing before getting the right sentiments. The dataset is extracted from Kaggle and Twitter, pre-processing performed using NLTK and Scikit-learn and features selection and extraction is done for Bag of Words (BOW), Term Frequency (TF) and Inverse Document Frequency (IDF) scheme. For polarity identification, we evaluated five different Machine Learning (ML) algorithms viz Multinomial Naive Bayes (MNB), Logistic Regression (LR), Decision Trees (DT), XGBoost (XGB) and Support Vector Machines (SVM). We have performed a comparative analysis of the success for these algorithms in order to decide which algorithm works best for the given data-set in terms of recall, accuracy, F1-score and precision. We assess the effects of various pre-processing techniques on two datasets; one with domain and other not. It is demonstrated that SVM classifier outperformed the other classifiers with superior evaluations of 73.12% and 94.91% for accuracy and precision respectively. It is also highlighted in this research that the selection and representation of features along with various pre-processing techniques have a positive impact on the performance of the classification. The ultimate outcome indicates an improvement in sentiment classification and we noted that pre-processing approaches obviously suggest an improvement in the efficiency of the classifiers.

Publisher

Ediciones Universidad de Salamanca

Cited by 37 articles. 订阅此论文施引文献订阅此论文施引文献，注册后可以免费订阅5篇论文的施引文献，订阅后可以查看论文全部施引文献

1. Empowering Indonesian internet users: An approach to counter online toxicity and enhance digital well-being;Intelligent Systems with Applications;2024-06

2. Big data in transportation: a systematic literature analysis and topic classification;Knowledge and Information Systems;2024-05-08

3. Transformer-Based Memes Generation Using Text and Image;Advances in Computational Intelligence and Robotics;2024-02-27

4. Sentiment Analysis Using an Ensemble Approach on Flipkart Societal Media Data;2024 IEEE International Students' Conference on Electrical, Electronics and Computer Science (SCEECS);2024-02-24

5. Ensemble Learning Techniques for Classifying Stressed and Unstressed Textual Data;2024 IEEE International Students' Conference on Electrical, Electronics and Computer Science (SCEECS);2024-02-24