An Empirical Study of Korean Sentence Representation with Various Tokenizations-Reference-Cited by-同舟云学术

An Empirical Study of Korean Sentence Representation with Various Tokenizations

Published:2021-04-01 Issue:7 Volume:10 Page:845
ISSN:2079-9292
Container-title:Electronics
language:en
Short-container-title:Electronics

Author:

Cho Danbi^ORCID,Lee Hyunyoung^ORCID,Kang Seungshik^ORCID

Abstract

It is important how the token unit is defined in a sentence in natural language process tasks, such as text classification, machine translation, and generation. Many studies recently utilized the subword tokenization in language models such as BERT, KoBERT, and ALBERT. Although these language models achieved state-of-the-art results in various NLP tasks, it is not clear whether the subword tokenization is the best token unit for Korean sentence embedding. Thus, we carried out sentence embedding based on word, morpheme, subword, and submorpheme, respectively, on Korean sentiment analysis. We explored the two-sentence representation methods for sentence embedding: considering the order of tokens in a sentence and not considering the order. While inputting a sentence, which is decomposed by token unit, to the two-sentence representation methods, we construct the sentence embedding with various tokenizations to find the most effective token unit for Korean sentence embedding. In our work, we confirmed: the robustness of the subword unit for out-of-vocabulary (OOV) problems compared to other token units, the disadvantage of replacing whitespace with a particular symbol in the sentiment analysis task, and that the optimal vocabulary size is 16K in subword and submorpheme tokenization. We empirically noticed that the subword, which was tokenized by a vocabulary size of 16K without replacement of whitespace, was the most effective for sentence embedding on the Korean sentiment analysis task.

Funder

The Ministry of the Republic of Korea and the National Research Foundation of Korea

Publisher

MDPI AG

Subject

Electrical and Electronic Engineering,Computer Networks and Communications,Hardware and Architecture,Signal Processing,Control and Systems Engineering

Link

https://www.mdpi.com/2079-9292/10/7/845/pdf

Reference27 articles.

Cited by 3 articles. 订阅此论文施引文献订阅此论文施引文献，注册后可以免费订阅5篇论文的施引文献，订阅后可以查看论文全部施引文献

1. Unravelling long-stay tourist experiences and satisfaction: text mining and deep learning approaches;Current Issues in Tourism;2024-03-10

2. Eliciting Semantic Types of Legal Norms in Korean Legislation with Deep Learning;Communications in Computer and Information Science;2022

3. Enhancing Korean Named Entity Recognition With Linguistic Tokenization Strategies;IEEE Access;2021