Author:
Alizadeh Meysam,Zare Darya,Samei Zeynab,Alizadeh Mohammadamin,Kubli Mael,Aliahmadi Mohammadhadi,Ebrahimi Sarvenaz,Gilardi Fabrizio
Abstract
AbstractTwitter data has been widely used by researchers across various social and computer science disciplines. A common aim when working with Twitter data is the construction of a random sample of users from a given country. However, while several methods have been proposed in the literature, their comparative performance is mostly unexplored. In this paper, we implement four common methods to create a random sample of Twitter users in the US: 1% Stream, Bounding Box, Location Query, and Language Query. Then, we compare these methods according to their tweet- and user-level metrics as well as their accuracy in estimating the US population. Our results show that users collected by the 1% Stream method tend to have more tweets, tweets per day, followers, and friends, a fewer number of likes, are younger accounts, and include more male users compared to the other three methods. Moreover, it achieves the minimum error in estimating the US population. However, the 1% Stream method is time-consuming, cannot be used for the past time frames, and is not suitable when user engagement is part of the study. In situation where these three drawbacks are important, our results support the Bounding Box method as the second-best method.
Funder
HORIZON EUROPE European Research Council
University of Zurich
Publisher
Springer Science and Business Media LLC
Reference33 articles.
1. Alizadeh M, Cioffi-Revilla C (2014). Distributions of opinion and extremist radicalization: insights from agent-based modeling. In: Social Informatics: 6th international conference, SocInfo 2014, Barcelona, Spain, November 11–13, 2014. proceedings 6. Springer, pp 348–358
2. Alizadeh M, Weber I, Cioffi-Revilla C, Fortunato S, Macy M (2019) Psychology and morality of political extremists: evidence from Twitter language analysis of alt-right and Antifa. EPJ Data Sci 8(1):1–35
3. Alizadeh M, Lewis M, Zarandi MHF, Jolai F (2011) Determining significant parameters in the design of ANFIS. In: 2011 Annual meeting of the North American fuzzy information processing society. IEEE, pp 1–6
4. Alizadeh M, Shapiro JN, Buntain C, Tucker JA (2020) Content-based features predict social media influence operations. Sci Adv 6(30):eabb5824
5. Alizadeh M, Kubli M, Samei Z, Dehghani S, Bermeo J. D, Korobeynikova M, Gilardi F (2023) Open-source large language models outperform crowd workers and approach ChatGPT in text-annotation tasks. arXiv:2307.02179