A Method of Web Information Extraction Based on Building Different Sub Trees-Reference-Cited by-同舟云学术

A Method of Web Information Extraction Based on Building Different Sub Trees

Published:2013-05 Issue: Volume:694-697 Page:2513-2521
ISSN:1662-8985
Container-title:Advanced Materials Research
language:
Short-container-title:AMR

Author:

Wang Yuan Long¹,Jiang Hong²,Bing Zhao Hong¹,Zhang Li¹

Affiliation:

1. Liaoning Technical University

2. Liaoning John hon construction engineering co., LTD

Abstract

When extracting Web information, most researchers mixed the structure labels of DOM Tree with the text content. For solving this problem, we put forward a method of Web Information automatic extraction. Firstly, we get the set of DOM sub trees by partitioning the DOM Tree of the Web Page. Secondly, the nodes of all DOM sub trees are set the corresponding weights by the method this paper proposes. Based on this method, we get each set of different sub trees by comparing with the DOM sub trees which come from two the same data source and belongs to the same category. Thirdly, we get the data zone which contains the extracted information by computing the similarity of every two DOM sub trees in the set of different sub trees. Finally, the node path of every DOM sub tree in the data zone will be taken as the extraction rules which will be used to automatically extract the information from the new Web page of the same category. The experiment demonstrates that there are higher precision rate and recall rate. Meanwhile this method can save the time which the users spend on filtering the information.

Publisher

Trans Tech Publications, Ltd.

Subject

General Engineering

Link

https://www.scientific.net/AMR.694-697.2513.pdf

Reference10 articles.

1. Xiangwen Ji, Jianping Zeng, Shiyong Zhang, Chengong Wu. Tag tree template for Web information and schema extraction [J]. Expert Systems with Applications, (37), (2010), pp.8492-8498.

2. Calife M, Mooney R. Relational learning of pattern match rules for information extraction [C] /Proc of the 16th National Conf on Artificial Intelligence and 11th Conf on innovative Applications of Artificial Intelligence. M enlo Park, CA: AAAI, 1999, pp.328-334.

3. M uslea I, Minton S, Knoblock G. A hierar chical approach to wrapper in duction [C] /Proc of the 3rd Conf on Autonomous Agents. New York: ACM, (1999), pp.190-197.

4. Wei Liu, Xiaofeng Meng, Weiyi Meng. Vision-based Web data records extraction [C] /Proc of the 9th SIGM OD Int Workshop on Web and Database. New York: ACM, (2006), pp.20-25.

5. Giuseppe Della Penna, Daniele Magazzeni, Sergio Orefice. Visual extraction of information from web pages [J]. Journal of Visual Languages and Computing, (21), (2010), pp.23-32.