Affiliation:
1. Celal Bayar University, Turkey
Abstract
Web page classification is an important research direction on web mining. The abundant amount of data available on the web makes it essential to develop efficient and robust models for web mining tasks. Web page classification is the process of assigning a web page to a particular predefined category based on labelled data. It serves for several other web mining tasks, such as focused web crawling, web link analysis and contextual advertising. Machine learning and data mining methods have been successfully applied for several web mining tasks, including web page classification. Multiple classifier systems are a promising research direction in machine learning, which aims to combine several classifiers by differentiating base classifiers and/or dataset distributions so that more robust classification models can be built. This paper presents a comparative analysis of four different feature selections (correlation, consistency, information gain and chi-square-based feature selection) and four different ensemble learning methods (Boosting, Bagging, Dagging and Random Subspace) based on four different base learners (naive Bayes, K-nearest neighbour algorithm, C4.5 algorithm and FURIA algorithm). The article examines the predictive performance of ensemble methods for web page classification. The experimental results indicate that feature selection and ensemble learning can enhance the predictive performance of classifiers in web page classification. For the DMOZ-50 dataset, the highest average predictive performance (88.1%) is obtained with the combination of consistency-based feature selection with AdaBoost and naive Bayes algorithms, which is a promising result for web page classification. Experimental results indicate that Bagging and Random Subspace ensemble methods and correlation-based and consistency-based feature selection methods obtain better results in terms of accuracy rates.
Subject
Library and Information Sciences,Information Systems
Cited by
102 articles.
订阅此论文施引文献
订阅此论文施引文献,注册后可以免费订阅5篇论文的施引文献,订阅后可以查看论文全部施引文献