Deep Web Data Extraction Based on Regular Expression-Reference-Cited by-同舟云学术

Deep Web Data Extraction Based on Regular Expression

Published:2013-07 Issue: Volume:718-720 Page:2242-2247
ISSN:1662-8985
Container-title:Advanced Materials Research
language:
Short-container-title:AMR

Author:

Lin Tao¹,Qiang Bao Hua¹,Long Shi¹,Qian He¹

Affiliation:

1. Guilin University of Electronic Technology

Abstract

Data extraction is an important issue in Deep web data integration. In order to extract the query results of the Deep Web, it is firstly required to locate the target data block correctly. Due to the html source code of web pages can be parsed as well structured DOM, we proposed an effective algorithm for discerning the common path based on hierarchical DOM. Based on the common path and our predefined regular expression, the target data of the Deep Web can be extracted effectively. The experimental results on real websites show that our proposed algorithm is highly effective.

Publisher

Trans Tech Publications, Ltd.

Subject

General Engineering

Link

https://www.scientific.net/AMR.718-720.2242.pdf

Reference10 articles.

1. Liu Wei, Meng Xiaofeng, Meng Weiyi. A Survey of Deep Web Data Integration. Chinese Journal of Computers, Vol. 30, No. 9, (2007).

2. Chang K C. He B, Li C, Patel M, Zhang Z. Structured database on the Web: Observations and Implications. SIGMOD Record，2004, 33(3)：61-70.

3. Jayant M, Jeffery S R, Cohen S, et a1. Webscale Data Integration: You Call Only Afford to Pay as You Go. Proceedings of the 3rd Biennial Conference on Innovative Data Systems Research. Asilomar, USA, 2007: 342-350.

4. Lan Yi, Bing Liu, Xiaoli Li. Eliminating noisy Information in Web Pages for Data Mining. KDD, 2003: 331-335.

5. Liu L, Pu C, Han W. WRAP: An XML-enable Wrapper Construction System for Web Information Resource. In Proceedings of the 16th IEEE International Conference on Data Engineering, San Diego, California, (2000).

Cited by 2 articles. 订阅此论文施引文献订阅此论文施引文献，注册后可以免费订阅5篇论文的施引文献，订阅后可以查看论文全部施引文献

1. A framework enhancement method of deep web data extraction;Materials Today: Proceedings;2021-02

2. LTDE: A Layout Tree Based Approach for Deep Page Data Extraction;IEICE Transactions on Information and Systems;2017