Crawling the German Health Web: Exploratory Study and Graph Analysis-Reference-Cited by-同舟云学术

Crawling the German Health Web: Exploratory Study and Graph Analysis

Published:2020-07-24 Issue:7 Volume:22 Page:e17853
ISSN:1438-8871
Container-title:Journal of Medical Internet Research
language:en
Short-container-title:J Med Internet Res

Author:

Zowalla Richard^ORCID,Wetter Thomas^ORCID,Pfeifer Daniel^ORCID

Abstract

Background The internet has become an increasingly important resource for health information. However, with a growing amount of web pages, it is nearly impossible for humans to manually keep track of evolving and continuously changing content in the health domain. To better understand the nature of all web-based health information as given in a specific language, it is important to identify (1) information hubs for the health domain, (2) content providers of high prestige, and (3) important topics and trends in the health-related web. In this context, an automatic web crawling approach can provide the necessary data for a computational and statistical analysis to answer (1) to (3). Objective This study demonstrates the suitability of a focused crawler for the acquisition of the German Health Web (GHW) which includes all health-related web content of the three mostly German speaking countries Germany, Austria and Switzerland. Based on the gathered data, we provide a preliminary analysis of the GHW’s graph structure covering its size, most important content providers and a ratio of public to private stakeholders. In addition, we provide our experiences in building and operating such a highly scalable crawler. Methods A support vector machine classifier was trained on a large data set acquired from various German content providers to distinguish between health-related and non–health-related web pages. The classifier was evaluated using accuracy, recall and precision on an 80/20 training/test split (TD1) and against a crowd-validated data set (TD2). To implement the crawler, we extended the open-source framework StormCrawler. The actual crawl was conducted for 227 days. The crawler was evaluated by using harvest rate and its recall was estimated using a seed-target approach. Results In total, n=22,405 seed URLs with country-code top level domains .de: 85.36% (19,126/22,405), .at: 6.83% (1530/22,405), .ch: 7.81% (1749/22,405), were collected from Curlie and a previous crawl. The text classifier achieved an accuracy on TD1 of 0.937 (TD2=0.966), a precision on TD1 of 0.934 (TD2=0.954) and a recall on TD1 of 0.944 (TD2=0.989). The crawl yields 13.5 million presumably relevant and 119.5 million nonrelevant web pages. The average harvest rate was 19.76%; recall was 0.821 (4105/5000 targets found). The resulting host-aggregated graph contains 215,372 nodes and 403,175 edges (network diameter=25; average path length=6.466; average degree=1.872; average in-degree=1.892; average out-degree=1.845; modularity=0.723). Among the 25 top-ranked pages for each country (according to PageRank), 40% (30/75) were web sites published by public institutions. 25% (19/75) were published by nonprofit organizations and 35% (26/75) by private organizations or individuals. Conclusions The results indicate, that the presented crawler is a suitable method for acquiring a large fraction of the GHW. As desired, the computed statistical data allows for determining major information hubs and important content providers on the GHW. In the future, the acquired data may be used to assess important topics and trends but also to build health-specific search engines.

Publisher

JMIR Publications Inc.

Subject

Health Informatics

Reference90 articles.

1. Consumer health information seeking on the Internet: the state of the art

2. How do consumers search for and appraise health information on the world wide web? Qualitative study using focus groups, usability tests, and in-depth interviews

3. European citizens' use of E-health services: A study of seven countries

4. FoxSDugganMHealth Online 2013201301152020-06-08WashingtonPew Internet and American Life Projecthttps://www.pewresearch.org/internet/wp-content/uploads/sites/9/media/Files/Reports/PIP_HealthOnline.pdf

Cited by 9 articles. 订阅此论文施引文献订阅此论文施引文献，注册后可以免费订阅5篇论文的施引文献，订阅后可以查看论文全部施引文献

1. Keyword-based e-commerce data crawling set using focused crawler (case study: Aquascape community);AIP Conference Proceedings;2024

2. An Optimal Topic Centric Crawler for Acquiring Bio-medical Themes Utilizing Gaussian Support Vector Regression;SN Computer Science;2023-10-31

3. Quantitative Analysis of Group for Epidemiology Architectural Approach;Annals of Data Science;2023-08-18

4. Readability and topics of the German Health Web: Exploratory study and text analysis;PLOS ONE;2023-02-10

5. An Automatic Detection System for Fake Japanese Shopping Sites Using fastText and LightGBM;IEEE Access;2023