Machine Learning to Detect Self-Reporting of Symptoms, Testing Access, and Recovery Associated With COVID-19 on Twitter: Retrospective Big Data Infoveillance Study-Reference-Cited by-同舟云学术

Machine Learning to Detect Self-Reporting of Symptoms, Testing Access, and Recovery Associated With COVID-19 on Twitter: Retrospective Big Data Infoveillance Study

Published:2020-06-08 Issue:2 Volume:6 Page:e19509
ISSN:2369-2960
Container-title:JMIR Public Health and Surveillance
language:en
Short-container-title:JMIR Public Health Surveill

Author:

Mackey Tim^ORCID,Purushothaman Vidya^ORCID,Li Jiawei^ORCID,Shah Neal^ORCID,Nali Matthew^ORCID,Bardier Cortni^ORCID,Liang Bryan^ORCID,Cai Mingxiang^ORCID,Cuomo Raphael^ORCID

Abstract

Background The coronavirus disease (COVID-19) pandemic is a global health emergency with over 6 million cases worldwide as of the beginning of June 2020. The pandemic is historic in scope and precedent given its emergence in an increasingly digital era. Importantly, there have been concerns about the accuracy of COVID-19 case counts due to issues such as lack of access to testing and difficulty in measuring recoveries. Objective The aims of this study were to detect and characterize user-generated conversations that could be associated with COVID-19-related symptoms, experiences with access to testing, and mentions of disease recovery using an unsupervised machine learning approach. Methods Tweets were collected from the Twitter public streaming application programming interface from March 3-20, 2020, filtered for general COVID-19-related keywords and then further filtered for terms that could be related to COVID-19 symptoms as self-reported by users. Tweets were analyzed using an unsupervised machine learning approach called the biterm topic model (BTM), where groups of tweets containing the same word-related themes were separated into topic clusters that included conversations about symptoms, testing, and recovery. Tweets in these clusters were then extracted and manually annotated for content analysis and assessed for their statistical and geographic characteristics. Results A total of 4,492,954 tweets were collected that contained terms that could be related to COVID-19 symptoms. After using BTM to identify relevant topic clusters and removing duplicate tweets, we identified a total of 3465 (<1%) tweets that included user-generated conversations about experiences that users associated with possible COVID-19 symptoms and other disease experiences. These tweets were grouped into five main categories including first- and secondhand reports of symptoms, symptom reporting concurrent with lack of testing, discussion of recovery, confirmation of negative COVID-19 diagnosis after receiving testing, and users recalling symptoms and questioning whether they might have been previously infected with COVID-19. The co-occurrence of tweets for these themes was statistically significant for users reporting symptoms with a lack of testing and with a discussion of recovery. A total of 63% (n=1112) of the geotagged tweets were located in the United States. Conclusions This study used unsupervised machine learning for the purposes of characterizing self-reporting of symptoms, experiences with testing, and mentions of recovery related to COVID-19. Many users reported symptoms they thought were related to COVID-19, but they were not able to get tested to confirm their concerns. In the absence of testing availability and confirmation, accurate case estimations for this period of the outbreak may never be known. Future studies should continue to explore the utility of infoveillance approaches to estimate COVID-19 disease severity.

Publisher

JMIR Publications Inc.

Subject

Public Health, Environmental and Occupational Health,Health Informatics

Reference32 articles.

1. Covid-19 fatality is likely overestimated

2. Correcting under-reported COVID-19 case numbers: estimating the true scale of the pandemic

3. COVID-19—New Insights on a Rapidly Changing Epidemic

4. Level of underreporting including underdiagnosis before the first peak of COVID-19 in various countries: Preliminary retrospective results based on wavelets and deterministic modeling

5. HowardJYuGCNN202004042020-04-10Most people recover from Covid-19. Here's why it's hard to pinpoint exactly how manyhttps://www.cnn.com/2020/04/04/health/recovery-coronavirus-tracking-data-explainer/index.html

Cited by 117 articles. 订阅此论文施引文献订阅此论文施引文献，注册后可以免费订阅5篇论文的施引文献，订阅后可以查看论文全部施引文献

1. Dissecting the infodemic: An in-depth analysis of COVID-19 misinformation detection on X (formerly Twitter) utilizing machine learning and deep learning techniques;Heliyon;2024-09

2. COVIDHealth: A novel labeled dataset and machine learning-based web application for classifying COVID-19 discourses on Twitter;Heliyon;2024-07

3. An Online Tool for Monitoring and Understanding COVID-19 Based on Self-reporting Tweets and Large Language Models (Preprint);2024-06-12

4. A study of learning models for COVID-19 disease prediction;Journal of Ambient Intelligence and Humanized Computing;2024-03-28

5. Classification Performance Comparison of BERT and IndoBERT on SelfReport of COVID-19 Status on Social Media;Journal of Computer Sciences Institute;2024-03-20