A New Big Data Feature Selection Approach for Text Classification-Reference-Cited by-同舟云学术

A New Big Data Feature Selection Approach for Text Classification

Published:2021-04-19 Issue: Volume:2021 Page:1-10
ISSN:1875-919X
Container-title:Scientific Programming
language:en
Short-container-title:Scientific Programming

Author:

Amazal Houda¹^ORCID,Kissi Mohamed¹

Affiliation:

1. Computer Science Laboratory, Faculty of Sciences and Technologies, University Hassan II Casablanca, Mohammedia, Morocco

Abstract

Feature selection (FS) is a fundamental task for text classification problems. Text feature selection aims to represent documents using the most relevant features. This process can reduce the size of datasets and improve the performance of the machine learning algorithms. Many researchers have focused on elaborating efficient FS techniques. However, most of the proposed approaches are evaluated for small datasets and validated using single machines. As textual data dimensionality becomes higher, traditional FS methods must be improved and parallelized to handle textual big data. This paper proposes a distributed approach for feature selection based on mutual information (MI) method, which is widely applied in pattern recognition and machine learning. A drawback of MI is that it ignores the frequency of the terms during the selection of features. The proposal introduces a distributed FS method, namely, Maximum Term Frequency-Mutual Information (MTF-MI), based on term frequency and mutual information techniques to improve the quality of the selected features. The proposed approach is implemented on Hadoop using the MapReduce programming model. The effectiveness of MTF-MI is demonstrated through several text classification experiments using the multinomial Naïve Bayes classifier on three datasets. Through a series of tests, the results reveal that the proposed MTF-MI method improves the classification results compared with four state-of-the-art methods in terms of macro-F1 and micro-F1 measures.

Publisher

Hindawi Limited

Subject

Computer Science Applications,Software

Link

http://downloads.hindawi.com/journals/sp/2021/6645345.pdf

Reference39 articles.

1. Sentiment analysis on product reviews based on weighted word embeddings and deep neural networks;A. Onan;Concurrency and Computation: Practice and Experience,2020

2. A feature selection model based on genetic rank aggregation for text sentiment classification

3. Hybrid supervised clustering based ensemble scheme for text classification;A. Onan;Kybernetes,2017

4. Two-Stage Topic Extraction Model for Bibliometric Data Analysis Based on Word Embeddings and Clustering

5. A Term Weighted Neural Language Model and Stacked Bidirectional LSTM Based Framework for Sarcasm Identification

Cited by 10 articles. 订阅此论文施引文献订阅此论文施引文献，注册后可以免费订阅5篇论文的施引文献，订阅后可以查看论文全部施引文献

1. Comparative performance analysis of Boruta, SHAP, and Borutashap for disease diagnosis: A study with multiple machine learning algorithms;Network: Computation in Neural Systems;2024-03-21

2. A Systematic Review of the Sarcasm Detection in the Twitter Dataset;International Journal of Recent Technology and Engineering (IJRTE);2024-01-30

3. An integrated approach for depression diagnosis using 3S feature embeddings and G-BLS with T-pHBGO optimizer;Expert Systems with Applications;2024-01

4. Research on Image Semantic Segmentation Based on Hybrid Cascade Feature Fusion and Detailed Attention Mechanism;IEEE Access;2024

5. Filter feature selection methods for text classification: a review;Multimedia Tools and Applications;2023-05-11