Web Page Content Block Identification with Extended Block Properties

Author:

Griazev Kiril1ORCID,Ramanauskaitė Simona1ORCID

Affiliation:

1. Department of Information Technologies, Vilnius Gediminas Technical University, Sauletekio al. 11, LT-10223 Vilnius, Lithuania

Abstract

Web page segmentation is one of the most influential factors for the automated integration of web page content with other systems. Existing solutions are focused on segmentation but do not provide a more detailed description of the segment including its range (minimum and maximum HTML code bounds, covering the segment content) and variants (the same segments with different content). Therefore the paper proposes a novel solution designed to find all web page content blocks and detail them for further usage. It applies text similarity and document object model (DOM) tree analysis methods to indicate the maximum and minimum ranges of each identified HTML block. In addition, it indicates its relation to other blocks, including hierarchical as well as sibling blocks. The evaluation of the method reveals its ability to identify more content blocks in comparison to human labeling (in manual labeling only 24% of blocks were labeled). By using the proposed method, manual labeling effort could be reduced by at least 70%. Better performance was observed in comparison to other analyzed web page segmentation methods, and better recall was achieved due to focus on processing every block present on a page, and providing a more detailed web page division into content block data by presenting block boundary range and block variation data.

Publisher

MDPI AG

Subject

Fluid Flow and Transfer Processes,Computer Science Applications,Process Chemistry and Technology,General Engineering,Instrumentation,General Materials Science

Reference41 articles.

1. Xie, W., Zheng, W., Tang, P., and Ting, Y. (2022, January 15–17). Design and Implementation of Web Information Extraction System Based on Crawler. Proceedings of the 2nd International Conference on Electronic Materials and Information Engineering (EMIE 2022), Hangzhou, China.

2. Machine learning techniques in Web content mining: A comparative analysis;Anami;J. Inf. Knowl. Manag.,2014

3. Cheng, S.C., and Lu, C.A. (2019, January 7–10). Retrieving Articles and Image Labeling Based on Relevance of Keywords. Proceedings of the 2019 International Conference on Machine Learning and Cybernetics (ICMLC), Kobe, Japan.

4. Web page classification: A survey of perspectives, gaps, and future directions;Hashemi;Multimed. Tools Appl.,2020

5. Autonomous schema markups based on intelligent computing for search engine optimization;Abbasi;PeerJ Comput. Sci.,2022

同舟云学术

1.学者识别学者识别

2.学术分析学术分析

3.人才评估人才评估

"同舟云学术"是以全球学者为主线,采集、加工和组织学术论文而形成的新型学术文献查询和分析系统,可以对全球学者进行文献检索和人才价值评估。用户可以通过关注某些学科领域的顶尖人物而持续追踪该领域的学科进展和研究前沿。经过近期的数据扩容,当前同舟云学术共收录了国内外主流学术期刊6万余种,收集的期刊论文及会议论文总量共计约1.5亿篇,并以每天添加12000余篇中外论文的速度递增。我们也可以为用户提供个性化、定制化的学者数据。欢迎来电咨询!咨询电话:010-8811{复制后删除}0370

www.globalauthorid.com

TOP

Copyright © 2019-2024 北京同舟云网络信息技术有限公司
京公网安备11010802033243号  京ICP备18003416号-3