A Comprehensive Study of Features and Algorithms for URL-Based Topic Classification-Reference-Cited by-同舟云学术

A Comprehensive Study of Features and Algorithms for URL-Based Topic Classification

Published:2011-07 Issue:3 Volume:5 Page:1-29
ISSN:1559-1131
Container-title:ACM Transactions on the Web
language:en
Short-container-title:ACM Trans. Web

Author:

Baykan Eda¹,Henzinger Monika²,Marian Ludmila³,Weber Ingmar⁴

Affiliation:

1. Izmir University

2. University of Vienna

3. CERN

4. Yahoo! Research

Abstract

Given only the URL of a Web page, can we identify its topic? We study this problem in detail by exploring a large number of different feature sets and algorithms on several datasets. We also show that the inherent overlap between topics and the sparsity of the information in URLs makes this a very challenging problem. Web page classification without a page’s content is desirable when the content is not available at all, when a classification is needed before obtaining the content, or when classification speed is of utmost importance. For our experiments we used five different corpora comprising a total of about 3 million (URL, classification) pairs. We evaluated several techniques for feature generation and classification algorithms. The individual binary classifiers were then combined via boosting into metabinary classifiers. We achieve typical F-measure values between 80 and 85, and a typical precision of around 86. The precision can be pushed further over 90 while maintaining a typical level of recall between 30 and 40.

Publisher

Association for Computing Machinery (ACM)

Subject

Computer Networks and Communications

Link

https://dl.acm.org/doi/pdf/10.1145/1993053.1993057

Reference44 articles.

1. P-TAG

2. A Large Scale Taxonomy Mapping Evaluation

3. Purely URL-based topic classification

4. Utilization of Data-Mining Techniques for Evaluation of Patterns of Asthma Drugs Use by Ambulatory Patients in a Large Health Maintenance Organization

Cited by 47 articles. 订阅此论文施引文献订阅此论文施引文献，注册后可以免费订阅5篇论文的施引文献，订阅后可以查看论文全部施引文献

1. Diversity-aware strategies for static index pruning;Information Processing & Management;2024-09

2. Understanding user intent modeling for conversational recommender systems: a systematic literature review;User Modeling and User-Adapted Interaction;2024-06-06

3. Geo-Insurance: Improving Big Data Challenges in the Context of Insurance Services Using a Geographical Information System (GIS);Human Behavior and Emerging Technologies;2024-01

4. A Novel Approach for Semi-supervised Learning: Incremental Parallel Training with Cross-Validation (IPT-CV);Arabian Journal for Science and Engineering;2022-11-23

5. WEB PAGE CLASSIFICATION WITH DEEP LEARNING METHODS;Uludağ University Journal of The Faculty of Engineering;2022-03-16