Language Statistics at Different Spatial, Temporal, and Grammatical Scales

Author:

Sánchez-Puig Fernanda123ORCID,Lozano-Aranda Rogelio12,Pérez-Méndez Dante24,Colman Ewan25ORCID,Morales-Guzmán Alfredo J.6,Rivera Torres Pedro Juan2ORCID,Pineda Carlos7ORCID,Gershenson Carlos28910ORCID

Affiliation:

1. Facultad de Ciencias, Universidad Nacional Autónoma de México, Mexico City 04510, Mexico

2. Centro de Ciencias de la Complejidad, Universidad Nacional Autónoma de México, Mexico City 04510, Mexico

3. Instituto de Fisica Interdisciplinar y Sistemas Complejos, Universidad de las Islas Baleares, 07122 Palma de Mallorca, Spain

4. Posgrado en Ciencias de la Computación, Universidad Nacional Autónoma de México, Mexico City 04510, Mexico

5. Roslin Institute, University of Edinburgh, Midlothian EH8 9YL, UK

6. MIT Media Lab, Cambridge, MA 02139, USA

7. Instituto de Física, Universidad Nacional Autónoma de México, Mexico City 04510, Mexico

8. School of Systems Science and Industrial Engineering, Binghamton University, Binghamton, NY 13902, USA

9. Instituto de Investigaciones en Matemáticas Aplicadas y Sistemas, Universidad Nacional Autónoma de México, Mexico City 04510, Mexico

10. Santa Fe Institute, Santa Fe, NM 87501, USA

Abstract

In recent decades, the field of statistical linguistics has made significant strides, which have been fueled by the availability of data. Leveraging Twitter data, this paper explores the English and Spanish languages, investigating their rank diversity across different scales: temporal intervals (ranging from 3 to 96 h), spatial radii (spanning 3 km to over 3000 km), and grammatical word ngrams (ranging from 1-grams to 5-grams). The analysis focuses on word ngrams, examining a time period of 1 year (2014) and eight different countries. Our findings highlight the relevance of all three scales with the most substantial changes observed at the grammatical level. Specifically, at the monogram level, rank diversity curves exhibit remarkable similarity across languages, countries, and temporal or spatial scales. However, as the grammatical scale expands, variations in rank diversity become more pronounced and influenced by temporal, spatial, linguistic, and national factors. Additionally, we investigate the statistical characteristics of Twitter-specific tokens, including emojis, hashtags, and user mentions, revealing a sigmoid pattern in their rank diversity function. These insights contribute to quantifying universal language statistics while also identifying potential sources of variation.

Funder

UNAM-PAPIIT

CONACyT

Publisher

MDPI AG

Reference44 articles.

1. Zipf, G.K. (1932). Selective Studies and the Principle of Relative Frequency in Language, Harvard University Press.

2. A “Law” of occurrences for words of low frequency;Booth;Inf. Control.,1967

3. Beyond the Zipf–Mandelbrot law in quantitative linguistics;Montemurro;Phys. A Stat. Mech. Its Appl.,2001

4. Power laws, Pareto distributions and Zipf’s law;Newman;Contemp. Phys.,2005

5. Zipf’s law unzipped;Baek;New J. Phys.,2011

同舟云学术

1.学者识别学者识别

2.学术分析学术分析

3.人才评估人才评估

"同舟云学术"是以全球学者为主线,采集、加工和组织学术论文而形成的新型学术文献查询和分析系统,可以对全球学者进行文献检索和人才价值评估。用户可以通过关注某些学科领域的顶尖人物而持续追踪该领域的学科进展和研究前沿。经过近期的数据扩容,当前同舟云学术共收录了国内外主流学术期刊6万余种,收集的期刊论文及会议论文总量共计约1.5亿篇,并以每天添加12000余篇中外论文的速度递增。我们也可以为用户提供个性化、定制化的学者数据。欢迎来电咨询!咨询电话:010-8811{复制后删除}0370

www.globalauthorid.com

TOP

Copyright © 2019-2024 北京同舟云网络信息技术有限公司
京公网安备11010802033243号  京ICP备18003416号-3