Toward Using Twitter for Tracking COVID-19: A Natural Language Processing Pipeline and Exploratory Data Set-Reference-Cited by-同舟云学术

Toward Using Twitter for Tracking COVID-19: A Natural Language Processing Pipeline and Exploratory Data Set

Published:2021-01-22 Issue:1 Volume:23 Page:e25314
ISSN:1438-8871
Container-title:Journal of Medical Internet Research
language:en
Short-container-title:J Med Internet Res

Author:

Klein Ari Z^ORCID,Magge Arjun^ORCID,O'Connor Karen^ORCID,Flores Amaro Jesus Ivan^ORCID,Weissenbacher Davy^ORCID,Gonzalez Hernandez Graciela^ORCID

Abstract

Background In the United States, the rapidly evolving COVID-19 outbreak, the shortage of available testing, and the delay of test results present challenges for actively monitoring its spread based on testing alone. Objective The objective of this study was to develop, evaluate, and deploy an automatic natural language processing pipeline to collect user-generated Twitter data as a complementary resource for identifying potential cases of COVID-19 in the United States that are not based on testing and, thus, may not have been reported to the Centers for Disease Control and Prevention. Methods Beginning January 23, 2020, we collected English tweets from the Twitter Streaming application programming interface that mention keywords related to COVID-19. We applied handwritten regular expressions to identify tweets indicating that the user potentially has been exposed to COVID-19. We automatically filtered out “reported speech” (eg, quotations, news headlines) from the tweets that matched the regular expressions, and two annotators annotated a random sample of 8976 tweets that are geo-tagged or have profile location metadata, distinguishing tweets that self-report potential cases of COVID-19 from those that do not. We used the annotated tweets to train and evaluate deep neural network classifiers based on bidirectional encoder representations from transformers (BERT). Finally, we deployed the automatic pipeline on more than 85 million unlabeled tweets that were continuously collected between March 1 and August 21, 2020. Results Interannotator agreement, based on dual annotations for 3644 (41%) of the 8976 tweets, was 0.77 (Cohen κ). A deep neural network classifier, based on a BERT model that was pretrained on tweets related to COVID-19, achieved an F1-score of 0.76 (precision=0.76, recall=0.76) for detecting tweets that self-report potential cases of COVID-19. Upon deploying our automatic pipeline, we identified 13,714 tweets that self-report potential cases of COVID-19 and have US state–level geolocations. Conclusions We have made the 13,714 tweets identified in this study, along with each tweet’s time stamp and US state–level geolocation, publicly available to download. This data set presents the opportunity for future work to assess the utility of Twitter data as a complementary resource for tracking the spread of COVID-19.

Publisher

JMIR Publications Inc.

Subject

Health Informatics

Reference19 articles.

1. Real-time tracking of self-reported symptoms to predict potential COVID-19

2. SmithAAndersonMSocial media use in 2018Pew Research Center201803012020-09-29https://www.pewresearch.org/internet/2018/03/01/social-media-use-in-2018/

3. Self-reported COVID-19 symptoms on Twitter: an analysis and a research resource

4. Identification of Risk Factors and Symptoms of COVID-19: Analysis of Biomedical Literature and Social Media Data

5. Machine Learning to Detect Self-Reporting of Symptoms, Testing Access, and Recovery Associated With COVID-19 on Twitter: Retrospective Big Data Infoveillance Study

Cited by 47 articles. 订阅此论文施引文献订阅此论文施引文献，注册后可以免费订阅5篇论文的施引文献，订阅后可以查看论文全部施引文献

1. Leveraging social computing for epidemic surveillance: A case study;Big Data Research;2024-11

2. CrisisTransformers: Pre-trained language models and sentence encoders for crisis-related social media texts;Knowledge-Based Systems;2024-07

3. Denoising Longitudinal Social Media for Pandemic Monitoring;2024-06-30

4. An Online Tool for Monitoring and Understanding COVID-19 Based on Self-reporting Tweets and Large Language Models (Preprint);2024-06-12

5. Classification Performance Comparison of BERT and IndoBERT on SelfReport of COVID-19 Status on Social Media;Journal of Computer Sciences Institute;2024-03-20