Affiliation:
1. Computer Engineering Department, Military Technical College, Cairo, Egypt
Abstract
The task of extracting the used feature vector in mining tasks (classification, clustering …etc.) is considered the most important task for enhancing the text processing capabilities. This paper proposes a novel approach to be used in building the feature vector used in web text document classification process; adding semantics in the generated feature vector. This approach is based on utilizing the benefit of the hierarchal structure of the WordNet ontology, to eliminate meaningless words from the generated feature vector that has no semantic relation with any of WordNet lexical categories; this leads to the reduction of the feature vector size without losing information on the text, also enriching the feature vector by concatenating each word with its corresponding WordNet lexical category. For mining tasks, the Vector Space Model (VSM) is used to represent text documents and the Term Frequency Inverse Document Frequency (TFIDF) is used as a term weighting technique. The proposed ontology based approach was evaluated against the Principal component analysis (PCA) approach, and against an ontology based reduction technique without the process of adding semantics to the generated feature vector using several experiments with five different classifiers (SVM, JRIP, J48, Naive-Bayes, and kNN). The experimental results reveal the effectiveness of the authors' proposed approach against other traditional approaches to achieve a better classification accuracy F-measure, precision, and recall.
Subject
Artificial Intelligence,Computer Graphics and Computer-Aided Design,Computer Networks and Communications,Computer Science Applications,Software
Reference23 articles.
1. Abdullah Bawakid, M. (2010). A semantic-based text classification system. In Proceedings of the 2010 IEEE 9th International Conference on Cybernetic Intelligent Systems, Reading, UK.
2. A Survey on Semantic Similarity Measure.;S.Anitha;International Journal of Research in Advent Technology,2014
3. An Overview of E-Documents Classification.;B. B.Aurangzeb Khan;International Conference on Machine Learning and Computing,2011
4. Daviddlewis. (n. d.). The Reuters dataset is available to be downloaded in sgml format from. Retrieved January 12, 2017, from http://www.daviddlewis.com/ressources/testcollections/reuters21578/
Cited by
10 articles.
订阅此论文施引文献
订阅此论文施引文献,注册后可以免费订阅5篇论文的施引文献,订阅后可以查看论文全部施引文献