Rethinking tokenization-Reference-Cited by-同舟云学术

Rethinking tokenization

Published:2024-06-17 Issue:1 Volume:11 Page:94-109
ISSN:2213-8706
Container-title:International Journal of Chinese Linguistics
language:en
Short-container-title:IJChL

Author:

Yang Jinbiao¹

Affiliation:

1. Max Planck Institute for Psycholinguistics

Abstract

Abstract Tokenization significantly influences language models (LMs)’ performance. This paper traces the evolution of tokenizers from word-level to subword-level, analyzing how they balance tokens and types to enhance model adaptability while controlling complexity. Despite subword tokenizers like Byte Pair Encoding (BPE) overcoming many word tokenizer limitations, they encounter difficulties in handling non-Latin languages and depend heavily on extensive training data and computational resources to grasp the nuances of multiword expressions (MWEs). This article argues that tokenizers, more than mere technical tools, should drawing inspiration from the cognitive science about human language processing. This study then introduces the “Principle of Least Effort” from cognitive science, that humans naturally seek to reduce cognitive effort, and discusses the benefits of this principle for tokenizer development. Based on this principle, the paper proposes that the Less-is-Better (LiB) model could be a new approach for LLM tokenizer. The LiB model can autonomously learn an integrated vocabulary consisting of subwords, words, and MWEs, which effectively reduces both the numbers of tokens and types. Comparative evaluations show that the LiB tokenizer outperforms existing word and BPE tokenizers, presenting an innovative method for tokenizer development, and hinting at the possibility of future cognitive science-based tokenizers being more efficient.

Publisher

John Benjamins Publishing Company

Link

http://www.jbe-platform.com/deliver/fulltext/ijchl.00023.yan.pdf

Reference38 articles.

1. More than Words: The Effect of Multi-word Frequency and Constituency on Phonetic Duration

2. Simplicity;Baker,2022

3. Analyzing Cognitive Plausibility of Subword Tokenization

4. Automatic segmentation and labeling of speech based on Hidden Markov Models

5. The Search for Simplicity: A Fundamental Cognitive Principle?

Cited by 2 articles. 订阅此论文施引文献订阅此论文施引文献，注册后可以免费订阅5篇论文的施引文献，订阅后可以查看论文全部施引文献

1. The origin and function of external representations;Adaptive Behavior;2024-06-21

2. Introduction;International Journal of Chinese Linguistics;2024-06-17