Affiliation:
1. Leibniz Institute for the German Language (IDS)
Abstract
AbstractOne of the fundamental questions about human language is whether all languages are equally complex. Here, we approach this question from an information-theoretic perspective. To this end, we conduct a large scale quantitative cross-linguistic analysis of written language by training a language model on more than 6,500 different documents as represented in 41 multilingual text collections, so-called corpora, consisting of ~3.5 billion words or ~9.0 billion characters and covering 2,069 different languages that are spoken as a native language by more than 90% of the world population. We statistically infer the entropy of each language model as an index of (un)predictability/complexity. We compare complexity rankings across corpora and show that a language that tends to be more complex than another language in one corpus also tends to be more complex in another corpus. This constitutes evidence against the equi-complexity hypothesis from an information-theoretic perspective, but also unveils a complexity-efficiency trade-off: high entropy languages are information-theoretically more efficient because they tend to need fewer symbols to encode messages. Our findings additionally contribute to debates about language evolution/diversity by showing that this trade-off is partly shaped by the social environment in which languages are being used.
Publisher
Research Square Platform LLC
Cited by
1 articles.
订阅此论文施引文献
订阅此论文施引文献,注册后可以免费订阅5篇论文的施引文献,订阅后可以查看论文全部施引文献