Combining URL and HTML Features for Entity Discovery in the Web-Reference-Cited by-同舟云学术

Combining URL and HTML Features for Entity Discovery in the Web

Published:2019-12-20 Issue:4 Volume:13 Page:1-27
ISSN:1559-1131
Container-title:ACM Transactions on the Web
language:en
Short-container-title:ACM Trans. Web

Author:

Manica Edimar¹,Dorneles Carina Friedrich²,Galante Renata³

Affiliation:

1. Federal Institute of Rio Grande do Sul, Ibirubá, Brazil

2. Federal University of Santa Catarina, Florianópolis, Brazil

3. Federal University of Rio Grande do Sul, Porto Alegre, Brazil

Abstract

The web is a large repository of entity-pages. An entity-page is a page that publishes data representing an entity of a particular type, for example, a page that describes a driver on a website about a car racing championship. The attribute values published in the entity-pages can be used for many data-driven companies, such as insurers, retailers, and search engines. In this article, we define a novel method, called SSUP , which discovers the entity-pages on the websites. The novelty of our method is that it combines URL and HTML features in a way that allows the URL terms to have different weights depending on their capacity to distinguish entity-pages from other pages, and thus the efficacy of the entity-page discovery task is increased. SSUP determines the similarity thresholds on each website without human intervention. We carried out experiments on a dataset with different real-world websites and a wide range of entity types. SSUP achieved a 95% rate of precision and 85% recall rate. Our method was compared with two state-of-the-art methods and outperformed them with a precision gain between 51% and 66%.

Publisher

Association for Computing Machinery (ACM)

Subject

Computer Networks and Communications

Link

https://dl.acm.org/doi/pdf/10.1145/3365574

Reference35 articles.

1. T. W. Anderson and J. D. Finn. 1996. The New Statistical Analysis of Data. Springer. T. W. Anderson and J. D. Finn. 1996. The New Statistical Analysis of Data. Springer.

2. Supporting the automatic construction of entity aware search engines

Cited by 2 articles. 订阅此论文施引文献订阅此论文施引文献，注册后可以免费订阅5篇论文的施引文献，订阅后可以查看论文全部施引文献

1. Scraping Relevant Images from Web Pages without Download;ACM Transactions on the Web;2023-10-11

2. Crawler by Contextual Inference;SN Computer Science;2021-04-16