Classification of Layout vs. Relational Tables on the Web: Machine Learning with Rendered Pages-Reference-Cited by-同舟云学术

Classification of Layout vs. Relational Tables on the Web: Machine Learning with Rendered Pages

Published:2022-12-20 Issue:1 Volume:17 Page:1-23
ISSN:1559-1131
Container-title:ACM Transactions on the Web
language:en
Short-container-title:ACM Trans. Web

Author:

Haider Waqar¹^ORCID,Yesilada Yeliz¹^ORCID

Affiliation:

1. Middle East Technical University Northern Cyprus Campus, Mersin, Turkey

Abstract

Table mining on the web is an open problem, and none of the previously proposed techniques provides a complete solution. Most research focuses on the structure of the HTML document, but because of the nature and structure of the web, it is still a challenging problem to detect relational tables. Web Content Accessibility Guidelines (WCAG) also cover a wide range of recommendations for making tables accessible, but our previous work shows that these recommendations are also not followed; therefore, tables are still inaccessible to disabled people and automated processing. We propose a new approach to table mining by not looking at the HTML structure, but rather, the rendered pages by the browser. The first task in table mining on the web is to classify relational vs. layout tables, and here, we propose two alternative approaches for that task. We first introduce our dataset, which includes 725 web pages with 9,957 extracted tables. Our first approach extracts features from a page after being rendered by the browser, then applies several machine learning algorithms in classifying the layout vs. relational tables. The best result is with Random Forest with the accuracy of 97.2% (F1-score: 0.955) with 10-fold cross-validation. Our second approach classifies tables using images taken from the same sources using Convolutional Neural Network (CNN), which gives an accuracy of 95% (F1-score: 0.95). Our work here shows that the web’s true essence comes after it goes through a browser and using the rendered pages and tables, the classification is more accurate compared to literature and paves the way in making the tables more accessible.

Publisher

Association for Computing Machinery (ACM)

Subject

Computer Networks and Communications

Link

https://dl.acm.org/doi/pdf/10.1145/3555349

Reference59 articles.

1. Alexa. 2019. Top Websites. (2019). Retrieved from https://www.alexa.com/.

2. Chandra Sekhar Bhagavatula, Thanapon Noraset, and Doug Downey. 2015. TabEL: Entity linking in web tables. In The Semantic Web - ISWC 2015, Marcelo Arenas, Oscar Corcho, Elena Simperl, Markus Strohmaier, Mathieu d’Aquin, Kavitha Srinivas, Paul Groth, Michel Dumontier, Jeff Heflin, Krishnaprasad Thirunarayan, Krishnaprasad Thirunarayan, and Steffen Staab (Eds.). Springer International Publishing, Cham, 425–441.

3. A Survey of Predictive Modeling on Imbalanced Domains

4. Michael Cafarella, Alon Halevy, Yang Zhang, Daisy Wang, and Eugene Wu. 2008. Uncovering the relational web. In International Workshop on the Web and Databases.

Cited by 4 articles. 订阅此论文施引文献订阅此论文施引文献，注册后可以免费订阅5篇论文的施引文献，订阅后可以查看论文全部施引文献

1. MedT2T: An adaptive pointer constrain generating method for a new medical text-to-table task;Future Generation Computer Systems;2024-12

2. Automatic Regular Expression Generation for Extracting Relevant Image Data From Web Pages Using Genetic Algorithms;IEEE Access;2024

3. Coordination analysis of layout and visual color difference in responsive web design based on PS software;Applied Mathematics and Nonlinear Sciences;2024-01-01

4. Scraping Relevant Images from Web Pages without Download;ACM Transactions on the Web;2023-10-11