Author:
Su Mu-Chun, ,Wang Shao-Jui,Huang Chen-Ko,Pa-ChunWang ,Hsu Fu-Hau,Lin Shih-Chieh,Hsieh Yi-Zeng, , , ,
Abstract
Most of the dramatically increased amount of information available on the World Wide Web is provided via HTML and formatted for human browsing rather than for software programs. This situation calls for a tool that automatically extracts information from semistructured Web information sources, increasing the usefulness of value-added Web services. We present a <u>si</u>gnal-<u>r</u>epresentation-b<u>a</u>sed <u>p</u>arser (SIRAP) that breaks Web pages up into logically coherent groups - groups of information related to an entity, for example. Templates for records with different tag structures are generated incrementally by a Histogram-Based Correlation Coefficient (HBCC) algorithm, then records on a Web page are detected efficiently using templates generated by matching. Hundreds of Web pages from 17 state-of-the-art search engines were used to demonstrate the feasibility of our approach.
Publisher
Fuji Technology Press Ltd.
Subject
Artificial Intelligence,Computer Vision and Pattern Recognition,Human-Computer Interaction