Affiliation:
1. School of Information Science and Engineering, Hebei University of Science and Technology, Shijiazhuang, Hebei 050018, P. R. China
2. School of Computer and Communication Engineering, University of Science and Technology Beijing, Beijing 100083, P. R. China
Abstract
There are rich data resources residing in available materials websites, and most of these data resources are shown in the form of HTML tables. However, it is difficult to distinguish the attributes and values because of the semi-structured feature of HTML tables. Therefore, identifying attributes in HTML tables is the key issue for the information acquisition. In this paper, based on sibling comparison, a method for materials knowledge extraction from HTML tables is proposed, which consists of three steps: acquiring sibling tables, identifying table pattern and extracting table data. We show how to use [Formula: see text]-measure to find the appropriate thresholds for matching of tables from materials websites when acquiring sibling tables. Further, we propose a strategy named FRFC (i.e. the First Row matching and First Column matching) to distinguish attributes and values, so that table pattern is identified. Moreover, the data from HTML tables is extracted based on their corresponding table patterns and mapped to a predefined schema, which will facilitate the population to materials ontology. The proposed approach is applicable to circumstances, where an attribute in the table may span multiple cells and matched attributes in sibling tables are more. We acquire desired accuracy ([Formula: see text]%) through using FRFC for identifying table pattern. The time about extraction may not increase significantly with increasing number of documents and cells in tables, so our approach is effective to process a large number of documents. A prototype named MTES is developed and demonstrates the effectiveness of our proposed approach.
Publisher
World Scientific Pub Co Pte Lt
Subject
Artificial Intelligence,Computer Graphics and Computer-Aided Design,Computer Networks and Communications,Software
Cited by
3 articles.
订阅此论文施引文献
订阅此论文施引文献,注册后可以免费订阅5篇论文的施引文献,订阅后可以查看论文全部施引文献
1. Transforming a Nonstandard Table into Formalized Tables;2017 14th Web Information Systems and Applications Conference (WISA);2017-11
2. Comparative Search of Entities;International Journal of Software Engineering and Knowledge Engineering;2017-10
3. FeRe: Exploiting influence of multi-dimensional features resided in news domain for recommendation;Information Processing & Management;2017-09