Strategies for the Analysis of Large Social Media Corpora: Sampling and Keyword Extraction Methods-Reference-Cited by-同舟云学术

Strategies for the Analysis of Large Social Media Corpora: Sampling and Keyword Extraction Methods

Published:2023-04-30 Issue:3 Volume:7 Page:241-265
ISSN:2509-9507
Container-title:Corpus Pragmatics
language:en
Short-container-title:Corpus Pragmatics

Author:

Moreno-Ortiz Antonio^ORCID,García-Gámez María^ORCID

Abstract

AbstractIn the context of the COVID-19 pandemic, social media platforms such as Twitter have been of great importance for users to exchange news, ideas, and perceptions. Researchers from fields such as discourse analysis and the social sciences have resorted to this content to explore public opinion and stance on this topic, and they have tried to gather information through the compilation of large-scale corpora. However, the size of such corpora is both an advantage and a drawback, as simple text retrieval techniques and tools may prove to be impractical or altogether incapable of handling such masses of data. This study provides methodological and practical cues on how to manage the contents of a large-scale social media corpus such as Chen et al. (JMIR Public Health Surveill 6(2):e19273, 2020) COVID-19 corpus. We compare and evaluate, in terms of efficiency and efficacy, available methods to handle such a large corpus. First, we compare different sample sizes to assess whether it is possible to achieve similar results despite the size difference and evaluate sampling methods following a specific data management approach to storing the original corpus. Second, we examine two keyword extraction methodologies commonly used to obtain a compact representation of the main subject and topics of a text: the traditional method used in corpus linguistics, which compares word frequencies using a reference corpus, and graph-based techniques as developed in Natural Language Processing tasks. The methods and strategies discussed in this study enable valuable quantitative and qualitative analyses of an otherwise intractable mass of social media data.

Funder

Ministerio de Ciencia, Innovación y Universidades

Consejería de Economía, Conocimiento, Empresas y Universidad, Junta de Andalucía

Ministerio de Educación y Formación Profesional

Universidad de Málaga

Publisher

Springer Science and Business Media LLC

Subject

Computer Science Applications,Linguistics and Language,Language and Linguistics

Link

https://link.springer.com/content/pdf/10.1007/s41701-023-00143-0.pdf

Reference38 articles.

1. Aiello, L. M., Quercia, D., Zhou, K., Constantinides, M., Šćepanović, S., & Joglekar, S. (2021). How epidemic psychology works on Twitter: Evolution of responses to the COVID-19 pandemic in the U.S. Humanities and Social Sciences Communications, 8(1), 179. https://doi.org/10.1057/s41599-021-00861-3

2. Anthony, L. (2022). AntConc (Version 4.0.10). Waseda University. https://www.laurenceanthony.net/software.

3. Bahja, M., & Safdar, G. A. (2020). Unlink the link between COVID-19 and 5G networks: An NLP and SNA based approach. IEEE Access, 8, 209127–209137. https://doi.org/10.1109/ACCESS.2020.3039168

4. Banda, J. M., Tekumalla, R., Wang, G., Yu, J., Liu, T., Ding, Y., Artemova, K., Tutubalina, E., & Chowell, G. (2020). A large-scale COVID-19 Twitter chatter dataset for open scientific research—An international collaboration (Version 30). Zenodo. https://doi.org/10.5281/ZENODO.4065674

5. Beliga, S., Meštrovic, A., & Martincic-Ipsic, S. (2015). An overview of graph-based keyword extraction methods and approaches. Journal of Information and Organizational Sciences, 39(1), 1–20.

Cited by 3 articles. 订阅此论文施引文献订阅此论文施引文献，注册后可以免费订阅5篇论文的施引文献，订阅后可以查看论文全部施引文献

1. Exploring the Potential Impact of GLP-1 Receptor Agonists on Substance Use, Compulsive Behavior, and Libido: Insights from Social Media Using a Mixed-Methods Approach;Brain Sciences;2024-06-20

2. WordPPR: A Researcher-Driven Computational Keyword Selection Method for Text Data Retrieval from Digital Media;Communication Methods and Measures;2023-11-14

3. GLP-1 Receptor Agonists and Related Mental Health Issues; Insights from a Range of Social Media Platforms Using a Mixed-Methods Approach;Brain Sciences;2023-10-24