Abstract
AbstractThis chapter provides a thorough discussion of Twitter/X corpora in terms of compilation and management. Twitter corpora differ from other types of corpora in many aspects, as they are composed of a very large number of very small documents (tweets), each with a slew of metadata, that can be downloaded through scripts that make use of available APIs, which calls for certain tools and techniques. The type of language used in social media is also very different from other, more standard genres, both in form and content. When this is coupled with large-size corpora, effective sampling techniques are necessary, which are discussed at length in this chapter. Finally, a description is given of using geotagged data and subcorpora creation and management.
Publisher
Springer Nature Switzerland
Reference24 articles.
1. Bamman, David, and Noah Smith. 2015. Contextualized Sarcasm Detection on Twitter. Proceedings of the International AAAI Conference on Web and Social Media 9: 574–577. https://doi.org/10.1609/icwsm.v9i1.14655.
2. Beliga, Slobodan, Ana Meštrovic, and Sanda Martincic-Ipsic. 2015. An Overview of Graph-Based Keyword Extraction Methods and Approaches. Journal of Information and Organizational Sciences 39: 1–20.
3. Bellhouse, D. R. 2014. Systematic Sampling Methods. In Wiley StatsRef: Statistics Reference Online, ed. N. Balakrishnan, Theodore Colton, Brian Everitt, Walter Piegorsch, Fabrizio Ruggeri, and Jozef L. Teugels, 1st ed. Wiley. https://doi.org/10.1002/9781118445112.stat05723.
4. Boyd, Danah, and Kate Crawford. 2012. Critical Questions for Big Data: Provocations for a Cultural, Technological, and Scholarly Phenomenon. Information, Communication & Society 15: 662–679. https://doi.org/10.1080/1369118X.2012.678878.
5. Brown, James Dean. 2012. Sampling: Quantitative Methods. In The Encyclopedia of Applied Linguistics. Wiley. https://doi.org/10.1002/9781405198431.wbeal1033.