Affiliation:
1. King Mongkut’s University of Technology North Bangkok, Faculty of Information Technology and Digital Innovation, Bangkok, Thailand
2. University of Hagen, Chair of Communication Networks, Hagen, Germany
Abstract
Word segmentation is necessary for many natural language processing, especially Thai language, that is, unsegmented words. However, wrong segmentation causes terrible performance in the final result. In this study, we propose two new brain-inspired methods based on Hawkins’ approach to address Thai word segmentation. Sparse Distributed Representations (SDRs) are used to model the neocortex structure of the brain to store and transfer information. The first proposed method, THDICTSDR, improves the dictionary-based approach by utilizing SDRs to learn the surrounding context and combine with n-gram to select the correct word. The second method uses SDRs instead of a dictionary and is called THSDR. The evaluation uses the BEST2010 and LST20 standard datasets for segmentation words by comparing them with the longest matching, newmm, and Deepcut, which is state-of-the-art in the deep learning approach. The result shows that the first method provides the accuracy, and performances are significantly better than other dictionary bases. The first new method can achieve F1-Score at 95.60%, comparable to the state-of-the-art and Deepcut F1-Score at 96.34%. However, it provides a better performance F1-Score at 96.78% in learning all vocabularies. In addition, it can achieve 99.48% F1-Score beyond Deepcut 97.65% in case of all sentences being learnt. The second method has fault tolerance to noise and provides overall result over deep learning in all cases.
Subject
General Mathematics,General Medicine,General Neuroscience,General Computer Science
Reference28 articles.
1. Thai Natural Language Processing - word segmentation, semantic analysis, and application;C. Tapsai,2021
2. PornprasertsakulA.Thai Syntactic Analysis1994Pathumthani, ThailandAsian Institute of TechnologyPh.D. thesis
3. Segmenting Words in Thai Language Using Minimum Text Units and Conditional Random Field
4. DeepCut: a Thai word tokenization library using Deep Neural Network;R. Kittinaradorn,2019
5. AttaCut: a fast and accurate neural Thai word segmenter;P. Chormai,2019