Affiliation:
1. Rice University, Houston, USA
Abstract
Tokenization (also known as scanning or lexing) is a computational task that has applications in the lexical analysis of programs during compilation and in data extraction and analysis for unstructured or semi-structured data (e.g., data represented using the JSON and CSV data formats). We propose two algorithms for the tokenization problem that have linear time complexity (in the length of the input text) without using large amounts of memory. We also show that an optimized version of one of these algorithms performs well compared to prior approaches on practical tokenization workloads.
Publisher
Association for Computing Machinery (ACM)