Abstract
AbstractObjectiveCurrent studies leveraging social media data for disease monitoring face challenges like noisy colloquial language and insufficient tracking of user disease progression in longitudinal data settings. This study aims to develop a pipeline for collecting, cleaning, and analyzing large-scale longitudinal social media data for disease monitoring, with a focus on COVID-19 pandemic.Materials and MethodsThis pipeline initiates by screening COVID-19 cases from tweets spanning February 1, 2020, to April 30, 2022. Longitudinal data is collected for each patient, two months before and three months after self-reporting. Symptoms are extracted using Name Entity Recognition (NER), followed by denoising with a combination of Graph Convolutional Network (GCN) and Bidirectional Encoder Representations from Transformers (BERT) model to retain only User Symptom Mentions (USM). Subsequently, symptoms are mapped to standardized medical concepts using the Unified Medical Language System (UMLS). Finally, this study conducts symptom pattern analysis and visualization to illustrate temporal changes in symptom prevalence and co-occurrence.ResultsThis study identified 191,096 self-reported COVID-19-positive cases from COVID-19-related tweets and retrospectively collected 811,398,280 historical tweets, of which 2,120,964 contained symptoms information. After denoising, 39% (832,287) of symptom-sharing tweets reflected user-related mentions. The trained USM model achieved an F1 score of 0.926. Further analysis revealed a higher prevalence of upper respiratory tract symptoms during the Omicron period compared to the Delta and wild-type periods. Additionally, there was a pronounced co-occurrence of lower respiratory tract and nervous system symptoms in the wild-type strain and Delta variant.ConclusionThis study established a robust framework for pandemic monitoring via social media, integrating denoising of user-related symptom mentions and longitudinal data. The findings underscore the importance of denoising procedures in revealing accurate prevalence trends, thereby minimizing biases in symptom analysis.
Publisher
Cold Spring Harbor Laboratory