Affiliation:
1. School of Computer Science, University of Windsor, Windsor, ON, Canada
Abstract
The process of extracting comparative heterogeneous web content data which are derived and historical from related web pages is still at its infancy and not developed. Discovering potentially useful and previously unknown information or knowledge from web contents such as “list all articles on ’Sequential Pattern Mining’ written between 2007 and 2011 including title, authors, volume, abstract, paper, citation, year of publication,” would require finding the schema of web documents from different web pages, performing web content data integration, building their virtual or physical data warehouse before web content extraction and mining from the database. This paper proposes a technique for automatic web content data extraction, the WebOMiner system, which models web sites of a specific domain like Business to Customer (B2C) web sites, as object oriented database schemas. Then, non-deterministic finite state automata (NFA) based wrappers for recognizing content types from this domain are built and used for extraction of related contents from data blocks into an integrated database for future second level mining for deep knowledge discovery.
Subject
Hardware and Architecture,Software
Reference30 articles.
1. Annoni, E., & Ezeife, C. I. (2009). Modeling web documents as objects for automatic web content extraction. In Proceedings of the ACM / LNCS Sponsored 11th International Conference on Enterprise Information Systems (pp. 91-100).
2. Baumgartner, R., Flesca, S., & Gottlob, G. (2001). Visual web information extraction with Lixto. In Proceedings of the 27th International Conference on Very Large Data Bases (pp. 119-128).
3. Bhowmick, S. S., Madria, S. K., Ng, W. K., & Lim, E. P. (1999). Web warehousing: Design and issues. In Y. Kambayashi, D.-L. Lee, E. Lim, M. Mohania, & Y. Masunaga (Eds.), Proceedings of the Workshops on Advances in Database Technologies (LNCS 1552, 93-105).
4. Borges, J., & Levene, M. (1999). Data mining of user navigation patterns. In Proceedings of the KDD Workshop on Web Mining, San Diego, CA (pp. 31-36).
5. Bornhövd, C., & Buchmann, A. P. (1999, June). A prototype for metadata-based integration of internet sources. In M. Jarke & A. Oberweis (Eds.), Proceedings of the 11th International Conference on Advanced Information Systems Engineering (LNCS 1626, pp. 439-445).
Cited by
1 articles.
订阅此论文施引文献
订阅此论文施引文献,注册后可以免费订阅5篇论文的施引文献,订阅后可以查看论文全部施引文献
1. Comparative Mining of B2C Web Sites by Discovering Web Database Schemas;Proceedings of the 20th International Database Engineering & Applications Symposium on - IDEAS '16;2016