Affiliation:
1. Max Planck Institute for Psycholinguistics
Abstract
Abstract
Tokenization significantly influences language models (LMs)’ performance. This paper traces the evolution of tokenizers from word-level to
subword-level, analyzing how they balance tokens and types to enhance model
adaptability while controlling complexity. Despite subword tokenizers like Byte
Pair Encoding (BPE) overcoming many word tokenizer limitations, they encounter
difficulties in handling non-Latin languages and depend heavily on extensive
training data and computational resources to grasp the nuances of multiword
expressions (MWEs). This article argues that tokenizers, more than mere
technical tools, should drawing inspiration from the cognitive science about
human language processing. This study then introduces the “Principle of Least
Effort” from cognitive science, that humans naturally seek to reduce cognitive
effort, and discusses the benefits of this principle for tokenizer development.
Based on this principle, the paper proposes that the Less-is-Better (LiB) model
could be a new approach for LLM tokenizer. The LiB model can autonomously learn
an integrated vocabulary consisting of subwords, words, and MWEs, which
effectively reduces both the numbers of tokens and types. Comparative
evaluations show that the LiB tokenizer outperforms existing word and BPE
tokenizers, presenting an innovative method for tokenizer development, and
hinting at the possibility of future cognitive science-based tokenizers being
more efficient.
Publisher
John Benjamins Publishing Company
Cited by
2 articles.
订阅此论文施引文献
订阅此论文施引文献,注册后可以免费订阅5篇论文的施引文献,订阅后可以查看论文全部施引文献
1. The origin and function of external representations;Adaptive Behavior;2024-06-21
2. Introduction;International Journal of Chinese Linguistics;2024-06-17