Abstract
ABSTRACTAdvances in automated data processing, together with the unprecedented growth in user-generated social media (SM) content, have made public health surveillance (PHS) one of the long-lasting SM applications. However, the existing PHS systems feeding on SM data have not been widely deployed in national surveillance systems, which appears to stem from the lack of practitioners’ trust in SM data. More robust datasets over which machine learning (ML) models can be trained/tested reliably is a significant step toward overcoming this hurdle. The health implications of physical activity, sedentary behaviour, and sleep (PASS) are widely studied through traditional data sources, which are often out-of-date, costly to collect, and thus limited in quantity and coverage. We present LPHEADA, a multicountry and fully Labelled digital Public HEAlth DAtaset of tweets originated in Australia/Canada/United Kingdom/United States between November 2018-June 2020. LPHEADA contains 366,405 labels for 122,135 PASS-related tweets and provides details about the place/time/demographics associated with each tweet. LPHEADA is publicly available and can be utilized to develop (un)supervised ML models for digital PASS surveillance.
Publisher
Cold Spring Harbor Laboratory
Reference40 articles.
1. Kemp, S. Digital 2020: July global statshot. DATAREPORTAL. Available online: https://datareportal.com/reports/digital-2020-july-global-statshot (accessed on 8 January 2021) (2020).
2. The use of social media in public health surveillance;West. Pac. surveillance response journal: WPSAR,2015
3. Digital disease detection—harnessing the web for public health surveillance;The New Engl. journal medicine,2009
4. Twitter as a tool for health research: a systematic review;Am. journal public health,2017
5. Social medicine: Twitter in healthcare;J. clinical medicine,2018