Affiliation:
1. Guilin University of Electronic Technology
Abstract
Data extraction is an important issue in Deep web data integration. In order to extract the query results of the Deep Web, it is firstly required to locate the target data block correctly. Due to the html source code of web pages can be parsed as well structured DOM, we proposed an effective algorithm for discerning the common path based on hierarchical DOM. Based on the common path and our predefined regular expression, the target data of the Deep Web can be extracted effectively. The experimental results on real websites show that our proposed algorithm is highly effective.
Publisher
Trans Tech Publications, Ltd.
Reference10 articles.
1. Liu Wei, Meng Xiaofeng, Meng Weiyi. A Survey of Deep Web Data Integration. Chinese Journal of Computers, Vol. 30, No. 9, (2007).
2. Chang K C. He B, Li C, Patel M, Zhang Z. Structured database on the Web: Observations and Implications. SIGMOD Record,2004, 33(3):61-70.
3. Jayant M, Jeffery S R, Cohen S, et a1. Webscale Data Integration: You Call Only Afford to Pay as You Go. Proceedings of the 3rd Biennial Conference on Innovative Data Systems Research. Asilomar, USA, 2007: 342-350.
4. Lan Yi, Bing Liu, Xiaoli Li. Eliminating noisy Information in Web Pages for Data Mining. KDD, 2003: 331-335.
5. Liu L, Pu C, Han W. WRAP: An XML-enable Wrapper Construction System for Web Information Resource. In Proceedings of the 16th IEEE International Conference on Data Engineering, San Diego, California, (2000).
Cited by
2 articles.
订阅此论文施引文献
订阅此论文施引文献,注册后可以免费订阅5篇论文的施引文献,订阅后可以查看论文全部施引文献