Affiliation:
1. Center for Automation Research, Institute for Advanced Computer Studies, Department of Computer Science, University of Maryland, College Park, MD
Abstract
Tabular data is an abundant source of information on the Web, but remains mostly isolated from the latter's interconnections since tables lack links and computer-accessible descriptions of their structure. In other words, the schemas of these tables -- attribute names, values, data types, etc. -- are not explicitly stored as table metadata. Consequently, the structure that these tables contain is not accessible to the crawlers that power search engines and thus not accessible to user search queries. We address this lack of structure with a new method for leveraging the principles of table construction in order to extract table schemas. Discovering the schema by which a table is constructed is achieved by harnessing the similarities and differences of nearby table rows through the use of a novel set of features and a feature processing scheme. The schemas of these data tables are determined using a classification technique based on conditional random fields in combination with a novel feature encoding method called logarithmic binning, which is specifically designed for the data table extraction task. Our method provides considerable improvement over the well-known WebTables schema extraction method. In contrast with previous work that focuses on extracting individual relations, our method excels at correctly interpreting full tables, thereby being capable of handling general tables such as those found in spreadsheets, instead of being restricted to HTML tables as is the case with the WebTables method. We also extract additional schema characteristics, such as row groupings, which are important for supporting information retrieval tasks on tabular data.
Subject
General Earth and Planetary Sciences,Water Science and Technology,Geography, Planning and Development
Cited by
61 articles.
订阅此论文施引文献
订阅此论文施引文献,注册后可以免费订阅5篇论文的施引文献,订阅后可以查看论文全部施引文献
1. Visualizing multilayer spatiotemporal epidemiological data with animated geocircles;Journal of the American Medical Informatics Association;2024-09-03
2. Gen-T: Table Reclamation in Data Lakes;2024 IEEE 40th International Conference on Data Engineering (ICDE);2024-05-13
3. Olio: A Semantic Search Interface for Data Repositories;Proceedings of the 36th Annual ACM Symposium on User Interface Software and Technology;2023-10-29
4. MORPHER: Structural Transformation of Ill-formed Rows;Proceedings of the 32nd ACM International Conference on Information and Knowledge Management;2023-10-21
5. SANTOS: Relationship-based Semantic Table Union Search;Proceedings of the ACM on Management of Data;2023-05-26