Affiliation:
1. Department of Information Technologies, Vilnius Gediminas Technical University, Sauletekio al. 11, LT-10223 Vilnius, Lithuania
Abstract
Web page segmentation is one of the most influential factors for the automated integration of web page content with other systems. Existing solutions are focused on segmentation but do not provide a more detailed description of the segment including its range (minimum and maximum HTML code bounds, covering the segment content) and variants (the same segments with different content). Therefore the paper proposes a novel solution designed to find all web page content blocks and detail them for further usage. It applies text similarity and document object model (DOM) tree analysis methods to indicate the maximum and minimum ranges of each identified HTML block. In addition, it indicates its relation to other blocks, including hierarchical as well as sibling blocks. The evaluation of the method reveals its ability to identify more content blocks in comparison to human labeling (in manual labeling only 24% of blocks were labeled). By using the proposed method, manual labeling effort could be reduced by at least 70%. Better performance was observed in comparison to other analyzed web page segmentation methods, and better recall was achieved due to focus on processing every block present on a page, and providing a more detailed web page division into content block data by presenting block boundary range and block variation data.
Subject
Fluid Flow and Transfer Processes,Computer Science Applications,Process Chemistry and Technology,General Engineering,Instrumentation,General Materials Science
Reference41 articles.
1. Xie, W., Zheng, W., Tang, P., and Ting, Y. (2022, January 15–17). Design and Implementation of Web Information Extraction System Based on Crawler. Proceedings of the 2nd International Conference on Electronic Materials and Information Engineering (EMIE 2022), Hangzhou, China.
2. Machine learning techniques in Web content mining: A comparative analysis;Anami;J. Inf. Knowl. Manag.,2014
3. Cheng, S.C., and Lu, C.A. (2019, January 7–10). Retrieving Articles and Image Labeling Based on Relevance of Keywords. Proceedings of the 2019 International Conference on Machine Learning and Cybernetics (ICMLC), Kobe, Japan.
4. Web page classification: A survey of perspectives, gaps, and future directions;Hashemi;Multimed. Tools Appl.,2020
5. Autonomous schema markups based on intelligent computing for search engine optimization;Abbasi;PeerJ Comput. Sci.,2022