Affiliation:
1. LegalOn Technologies Research Tokyo Japan
2. Center for Data‐driven Science and Artificial Intelligence Tohoku University Sendai Japan
Abstract
AbstractMultiple pattern matching in strings is a fundamental problem in text processing applications such as regular expressions or tokenization. This article studies efficient implementations of double‐array Aho–Corasick automata (DAACs), data structures for quickly performing the multiple pattern matching. The practical performance of DAACs is improved by carefully designing the data structure, and many implementation techniques have been proposed thus far. A problem in DAACs is that comprehensive descriptions and experimental analyses on their ideas are not provided. Engineers face difficulties in implementing an efficient DAAC. In this article, we review implementation techniques for DAACs and provide a comprehensive description of them. We also propose several new techniques for further improvement. We conduct exhaustive experiments through real‐world datasets and reveal the best combination of techniques to achieve a higher performance in DAACs. The best combination is different from those used in the most popular libraries of DAACs, which demonstrates that their performance can be further enhanced. On the basis of our experimental analysis, we developed a new Rust library for fast multiple pattern matching using DAACs, named Daachorse, as open‐source software at
https://github.com/daac‐tools/daachorse. Experiments demonstrate that Daachorse outperforms other AC‐automaton implementations, indicating its suitability as a fast alternative for multiple pattern matching in many applications.
Cited by
2 articles.
订阅此论文施引文献
订阅此论文施引文献,注册后可以免费订阅5篇论文的施引文献,订阅后可以查看论文全部施引文献