Pattern Matching-based scraping of news websites-Reference-Cited by-同舟云学术

Pattern Matching-based scraping of news websites

Published:2020-12-01 Issue:1 Volume:1694 Page:012011
ISSN:1742-6588
Container-title:Journal of Physics: Conference Series
language:
Short-container-title:J. Phys.: Conf. Ser.

Author:

Salem Hamza,Mazzara Manuel

Abstract

Abstract Web Scraping is the process of extracting content from human-readable websites in order to import it into local storage such as databases or CSV Files. The process of data extraction and its design is time-consuming requiring an analysis of the website, data representation of the objects comprising its structure (DOM), HTML tags, and the Cascading Style Sheets (CSS) classes. To support this process we aim at providing automation. In this paper, we propose a pattern mining technique to scrap news and blog websites by recognizing title and body based on a content structure pattern. This approach consists of three steps, i.e.: extracting news website structure, constructing a pattern of HTML content, and implementing the pattern as a set of rules in web scraping. Our approach is a simple, general, and straightforward way to extract articles that consist of the title, the body of any blogs, or news websites.

Publisher

IOP Publishing

Subject

General Physics and Astronomy

Link

https://iopscience.iop.org/article/10.1088/1742-6596/1694/1/012011/pdf

Reference11 articles.

1. A web crawler design for data mining;Thelwall;Journal of Information Science,2001

Cited by 4 articles. 订阅此论文施引文献订阅此论文施引文献，注册后可以免费订阅5篇论文的施引文献，订阅后可以查看论文全部施引文献

1. Multi Languages Pattern Matching-Based Scraping of News and Articles Websites;Advanced Information Networking and Applications;2023

2. Sentiment analysis using web scraping for live news data with machine learning algorithms;Materials Today: Proceedings;2022

3. Web Scraping Methods Used in Predicting Real Estate Prices;Advances in Computational Collective Intelligence;2021

4. Automatically Injecting Semantic Annotations into Online Articles;Advanced Information Networking and Applications;2021