Physical activity, sedentary behaviour, and sleep on Twitter: A multicountry and fully labelled dataset for public health surveillance research (Preprint)

Author:

Shakeri Hossein Abad ZahraORCID,Butler Gregory P.,Thompson Wendy,Lee JoonORCID

Abstract

BACKGROUND

Advances in automated data processing and machine learning (ML) models, together with the unprecedented growth in the number of social media users who publicly share and discuss health-related information, have made public health surveillance (PHS) one of the long-lasting social media applications. However, the existing PHS systems feeding on social media data have not been widely deployed in national surveillance systems, which appears to stem from the lack of practitioners and the public’s trust in social media data. More robust and reliable datasets over which supervised machine learning models can be trained and tested reliably is a significant step toward overcoming this hurdle.

OBJECTIVE

The health implications of daily behaviours (physical activity, sedentary behaviour, and sleep (PASS)), as an evergreen topic in PHS, are widely studied through traditional data sources such as surveillance surveys and administrative databases, which are often several months out of date by the time they are utilized, costly to collect, and thus limited in quantity and coverage. In this paper, we present LPHEADA, a multicountry and fully Labelled digital Public HEAlth DAtaset of tweets originated in Australia, Canada, the United Kingdom (UK), or the United States (US).

METHODS

We collected the data of this study from Twitter using the Twitter livestream application programming interface (API) between 28th November 2018 to 19th June 2020. To obtain PASS-related tweets for manual annotation, we iteratively used regular expressions, unsupervised natural language processing, domain-specific ontologies and linguistic analysis. We used Amazon Mechanical Turk (AMT) to label the collected data to self-reported PASS categories and implemented a quality control pipeline to monitor and manage the validity of crow-generated labels. Moreover, we used ML, latent semantic analysis, linguistic analysis, and label inference analysis to validate different components of the dataset.

RESULTS

LPHEADA contains 366,405 crowd-generated labels (three labels per tweet) for 122,135 PASS-related tweets, labelled by 708 unique annotators on AMT. In addition to crowd-generated labels, LPHEADA provides details about the three critical components of any PHS system: place, time, and demographics (gender, age range) associated with each tweet.

CONCLUSIONS

Publicly available datasets for digital PASS surveillance are usually isolated and only provide labels for small subsets of the data. We believe that the novelty and comprehensiveness of the dataset provided in this study will help develop, evaluate, and deploy digital PASS surveillance systems. LPHEADA will be an invaluable resource for both public health researchers and practitioners.

Publisher

JMIR Publications Inc.

同舟云学术

1.学者识别学者识别

2.学术分析学术分析

3.人才评估人才评估

"同舟云学术"是以全球学者为主线,采集、加工和组织学术论文而形成的新型学术文献查询和分析系统,可以对全球学者进行文献检索和人才价值评估。用户可以通过关注某些学科领域的顶尖人物而持续追踪该领域的学科进展和研究前沿。经过近期的数据扩容,当前同舟云学术共收录了国内外主流学术期刊6万余种,收集的期刊论文及会议论文总量共计约1.5亿篇,并以每天添加12000余篇中外论文的速度递增。我们也可以为用户提供个性化、定制化的学者数据。欢迎来电咨询!咨询电话:010-8811{复制后删除}0370

www.globalauthorid.com

TOP

Copyright © 2019-2024 北京同舟云网络信息技术有限公司
京公网安备11010802033243号  京ICP备18003416号-3