Affiliation:
1. Federal Institute of Rio Grande do Sul, Ibirubá, Brazil
2. Federal University of Santa Catarina, Florianópolis, Brazil
3. Federal University of Rio Grande do Sul, Porto Alegre, Brazil
Abstract
The web is a large repository of entity-pages. An entity-page is a page that publishes data representing an entity of a particular type, for example, a page that describes a driver on a website about a car racing championship. The attribute values published in the entity-pages can be used for many data-driven companies, such as insurers, retailers, and search engines. In this article, we define a novel method, called
SSUP
, which discovers the entity-pages on the websites. The novelty of our method is that it combines URL and HTML features in a way that allows the URL terms to have different weights depending on their capacity to distinguish entity-pages from other pages, and thus the efficacy of the entity-page discovery task is increased.
SSUP
determines the similarity thresholds on each website without human intervention. We carried out experiments on a dataset with different real-world websites and a wide range of entity types.
SSUP
achieved a 95% rate of precision and 85% recall rate. Our method was compared with two state-of-the-art methods and outperformed them with a precision gain between 51% and 66%.
Publisher
Association for Computing Machinery (ACM)
Subject
Computer Networks and Communications
Cited by
2 articles.
订阅此论文施引文献
订阅此论文施引文献,注册后可以免费订阅5篇论文的施引文献,订阅后可以查看论文全部施引文献