Abstract
An adaptive web information extraction approach is presented in this paper. Most of the traditional web information extraction approaches depend on the templates of web sites. If the templates are changed, the information extraction rules should be redesigned. To reduce the maintenance costs and improve the adaptability of information extractors, an adaptive web information extraction approach is proposed based on the STU-DOM tree. The webpage is parsed into DOM Trees based on HTML Parser. Then DOM trees are filtered into STU-DOM trees to confirm blocks which contain keywords of a certain topic. The proposed approach is applied to webpages and the results show that the approach not only extracts information efficiently, but also is irrelevant to site structures.
Publisher
Trans Tech Publications, Ltd.
Reference7 articles.
1. H. R. Zhang, C. Cui, Web Information Extraction Technology Research Based on Ajax [C]. International Conference on Business Computing and Global Informatization, (2011).
2. Y. F. Gong, Q. Liu, Automatic web Page Segmentation and Information Extraction Using Conditional Random Fields [C]. Proceedings of the 2012 IEEE 16th International Conference on Computer Supported Cooperative Work in Design (2012).
3. H. Ji, H. B. Deng, J. W. Han, Uncertainty Reduction for Knowledge Discovery and Information Extraction on the World Wide Web [J]. Proceedings of the IEEE 100(9) (2012).
4. T. L. Wong, W. Lam, Adapting web Information Extraction Knowledge via Mining Site-Invariant and Site-Dependent Features [J]. ACM Transactions on Internet Technology 7(1) (2007).
5. P. Yang, Q. L. Zheng, H. Peng, A Stepwise Learning Approach to Automatic Discover Interest Data Block [C]. The third International Conference on Machine Learning and Cyber2netics (ICMLC) (2004).