Abstract
Background
The COVID-19 pandemic and its associated public health mitigation strategies have dramatically changed patterns of daily life activities worldwide, resulting in unintentional consequences on behavioral risk factors, including smoking, alcohol consumption, poor nutrition, and physical inactivity. The infodemic of social media data may provide novel opportunities for evaluating changes related to behavioral risk factors during the pandemic.
Objective
We explored the feasibility of conducting a sentiment and emotion analysis using Twitter data to evaluate behavioral cancer risk factors (physical inactivity, poor nutrition, alcohol consumption, and smoking) over time during the first year of the COVID-19 pandemic.
Methods
Tweets during 2020 relating to the COVID-19 pandemic and the 4 cancer risk factors were extracted from the George Washington University Libraries Dataverse. Tweets were defined and filtered using keywords to create 4 data sets. We trained and tested a machine learning classifier using a prelabeled Twitter data set. This was applied to determine the sentiment (positive, negative, or neutral) of each tweet. A natural language processing package was used to identify the emotions (anger, anticipation, disgust, fear, joy, sadness, surprise, and trust) based on the words contained in the tweets. Sentiments and emotions for each of the risk factors were evaluated over time and analyzed to identify keywords that emerged.
Results
The sentiment analysis revealed that 56.69% (51,479/90,813) of the tweets about physical activity were positive, 16.4% (14,893/90,813) were negative, and 26.91% (24,441/90,813) were neutral. Similar patterns were observed for nutrition, where 55.44% (27,939/50,396), 15.78% (7950/50,396), and 28.79% (14,507/50,396) of the tweets were positive, negative, and neutral, respectively. For alcohol, the proportions of positive, negative, and neutral tweets were 46.85% (34,897/74,484), 22.9% (17,056/74,484), and 30.25% (22,531/74,484), respectively, and for smoking, they were 41.2% (11,628/28,220), 24.23% (6839/28,220), and 34.56% (9753/28,220), respectively. The sentiments were relatively stable over time. The emotion analysis suggests that the most common emotion expressed across physical activity and nutrition tweets was trust (69,495/320,741, 21.67% and 42,324/176,564, 23.97%, respectively); for alcohol, it was joy (49,147/273,128, 17.99%); and for smoking, it was fear (23,066/110,256, 20.92%). The emotions expressed remained relatively constant over the observed period. An analysis of the most frequent words tweeted revealed further insights into common themes expressed in relation to some of the risk factors and possible sources of bias.
Conclusions
This analysis provided insight into behavioral cancer risk factors as expressed on Twitter during the first year of the COVID-19 pandemic. It was feasible to extract tweets relating to all 4 risk factors, and most tweets had a positive sentiment with varied emotions across the different data sets. Although these results can play a role in promoting public health, a deeper dive via qualitative analysis can be conducted to provide a contextual examination of each tweet.
Subject
Health Informatics,Medicine (miscellaneous)