A Comprehensive Study of Techniques for URL-Based Web Page Language Classification-Reference-Cited by-同舟云学术

A Comprehensive Study of Techniques for URL-Based Web Page Language Classification

Published:2013-03 Issue:1 Volume:7 Page:1-37
ISSN:1559-1131
Container-title:ACM Transactions on the Web
language:en
Short-container-title:ACM Trans. Web

Author:

Baykan Eda¹,Henzinger Monika²,Weber Ingmar³

Affiliation:

1. Izmir University

2. University of Vienna

3. Yahoo! Research Barcelona

Abstract

Given only the URL of a Web page, can we identify its language? In this article we examine this question. URL-based language classification is useful when the content of the Web page is not available or downloading the content is a waste of bandwidth and time. We built URL-based language classifiers for English, German, French, Spanish, and Italian by applying a variety of algorithms and features. As algorithms we used machine learning algorithms which are widely applied for text classification and state-of-art algorithms for language identification of text. As features we used words, various sized n-grams, and custom-made features (our novel feature set). We compared our approaches with two baseline methods, namely classification by country code top-level domains and classification by IP addresses of the hosting Web servers. We trained and tested our classifiers in a 10-fold cross-validation setup on a dataset obtained from the Open Directory Project and from querying a commercial search engine. We obtained the lowest F1-measure for English (94) and the highest F1-measure for German (98) with the best performing classifiers. We also evaluated the performance of our methods: (i) on a set of Web pages written in Adobe Flash and (ii) as part of a language-focused crawler. In the first case, the content of the Web page is hard to extract and in the second page downloading pages of the “wrong” language constitutes a waste of bandwidth. In both settings the best classifiers have a high accuracy with an F1-measure between 95 (for English) and 98 (for Italian) for the Adobe Flash pages and a precision between 90 (for Italian) and 97 (for French) for the language-focused crawler.

Publisher

Association for Computing Machinery (ACM)

Subject

Computer Networks and Communications

Link

https://dl.acm.org/doi/pdf/10.1145/2435215.2435218

Reference41 articles.

1. Classifying Documents According to Locational Relevance

2. Web page language identification based on URLs

Cited by 10 articles. 订阅此论文施引文献订阅此论文施引文献，注册后可以免费订阅5篇论文的施引文献，订阅后可以查看论文全部施引文献

1. Homoglyph Attack Detection Model Using Machine Learning and Hash Function;Journal of Sensor and Actuator Networks;2022-09-16

2. WEB PAGE CLASSIFICATION WITH DEEP LEARNING METHODS;Uludağ University Journal of The Faculty of Engineering;2022-03-16

3. Incremental community discovery via latent network representation and probabilistic inference;Knowledge and Information Systems;2019-11-15

4. An Automatic and Scalable Application Crawler for Large-Scale Mobile Internet Content Retrieval;KSII Transactions on Internet and Information Systems;2018-10-31

5. DistrustRank;Proceedings of the 10th ACM Conference on Web Science;2018-05-15