A new neutrosophic TF-IDF term weighting for text mining tasks: text classification use case-Reference-Cited by-同舟云学术

A new neutrosophic TF-IDF term weighting for text mining tasks: text classification use case

Published:2021-04-08 Issue:3 Volume:17 Page:229-249
ISSN:1744-0084
Container-title:International Journal of Web Information Systems
language:en
Short-container-title:IJWIS

Author:

Bounabi Mariem,Elmoutaouakil Karim,Satori Khalid

Abstract

Purpose This paper aims to present a new term weighting approach for text classification as a text mining task. The original method, neutrosophic term frequency – inverse term frequency (NTF-IDF), is an extended version of the popular fuzzy TF-IDF (FTF-IDF) and uses the neutrosophic reasoning to analyze and generate weights for terms in natural languages. The paper also propose a comparative study between the popular FTF-IDF and NTF-IDF and their impacts on different machine learning (ML) classifiers for document categorization goals. Design/methodology/approach After preprocessing textual data, the original Neutrosophic TF-IDF applies the neutrosophic inference system (NIS) to produce weights for terms representing a document. Using the local frequency TF, global frequency IDF and text N's length as NIS inputs, this study generate two neutrosophic weights for a given term. The first measure provides information on the relevance degree for a word, and the second one represents their ambiguity degree. Next, the Zhang combination function is applied to combine neutrosophic weights outputs and present the final term weight, inserted in the document's representative vector. To analyze the NTF-IDF impact on the classification phase, this study uses a set of ML algorithms. Findings Practicing the neutrosophic logic (NL) characteristics, the authors have been able to study the ambiguity of the terms and their degree of relevance to represent a document. NL's choice has proven its effectiveness in defining significant text vectorization weights, especially for text classification tasks. The experimentation part demonstrates that the new method positively impacts the categorization. Moreover, the adopted system's recognition rate is higher than 91%, an accuracy score not attained using the FTF-IDF. Also, using benchmarked data sets, in different text mining fields, and many ML classifiers, i.e. SVM and Feed-Forward Network, and applying the proposed term scores NTF-IDF improves the accuracy by 10%. Originality/value The novelty of this paper lies in two aspects. First, a new term weighting method, which uses the term frequencies as components to define the relevance and the ambiguity of term; second, the application of NL to infer weights is considered as an original model in this paper, which also aims to correct the shortcomings of the FTF-IDF which uses fuzzy logic and its drawbacks. The introduced technique was combined with different ML models to improve the accuracy and relevance of the obtained feature vectors to fed the classification mechanism.

Publisher

Emerald

Subject

Computer Networks and Communications,Information Systems

Reference41 articles.

1. A comparison of supervised classification methods for a statistical set of features: application: amazigh OCR,2015

2. An information-theoretic perspective of tf–idf measures;Information Processing and Management,2003

3. An improved clustering method for text documents using neutrosophic logic,2017

4. Neutrosophic classifier: an extension of fuzzy classifer;Applied Soft Computing,2013

5. A comparison of text classification methods method of weighted terms selected by different stemming techniques,2017

Cited by 11 articles. 订阅此论文施引文献订阅此论文施引文献，注册后可以免费订阅5篇论文的施引文献，订阅后可以查看论文全部施引文献

1. Personalized new media marketing recommendation system based on TF-IDF algorithm optimizing LSTM-TC model;Service Oriented Computing and Applications;2024-08-06

2. Chinese and English text classification techniques incorporating CHI feature selection for ELT cloud classroom;Open Computer Science;2024-01-01

3. Classifying Evaluation Method of Innovative Teachers’ Teaching Ability Based on Multi Source Data Fusion;Lecture Notes of the Institute for Computer Sciences, Social Informatics and Telecommunications Engineering;2024

4. Data Mining Technology Helps Digital Teaching and Learning of English Majors in Colleges and Universities;Applied Mathematics and Nonlinear Sciences;2023-12-05

5. K-Means and Feature Selection Mechanism to Improve Performance of Clustering User Stories in Agile Development;2023 International Conference on Modeling & E-Information Research, Artificial Learning and Digital Applications (ICMERALDA);2023-11-24