Empirical Analysis of Parallel Corpora and In-Depth Analysis Using LIWC-Reference-Cited by-同舟云学术

Empirical Analysis of Parallel Corpora and In-Depth Analysis Using LIWC

Published:2022-05-30 Issue:11 Volume:12 Page:5545
ISSN:2076-3417
Container-title:Applied Sciences
language:en
Short-container-title:Applied Sciences

Author:

Park Chanjun^ORCID,Shim Midan,Eo Sugyeong,Lee Seolhwa^ORCID,Seo Jaehyung,Moon Hyeonseok^ORCID,Lim Heuiseok

Abstract

The machine translation system aims to translate source language into target language. Recent studies on MT systems mainly focus on neural machine translation. One factor that significantly affects the performance of NMT is the availability of high-quality parallel corpora. However, high-quality parallel corpora concerning Korean are relatively scarce compared to those associated with other high-resource languages, such as German or Italian. To address this problem, AI Hub recently released seven types of parallel corpora for Korean. In this study, we conduct an in-depth verification of the quality of corresponding parallel corpora through Linguistic Inquiry and Word Count (LIWC) and several relevant experiments. LIWC is a word-counting software program that can analyze corpora in multiple ways and extract linguistic features as a dictionary base. To the best of our knowledge, this study is the first to use LIWC to analyze parallel corpora in the field of NMT. Our findings suggest the direction of further research toward obtaining the improved quality parallel corpora through our correlation analysis in LIWC and NMT performance.

Funder

Ministry of Science and ICT

Publisher

MDPI AG

Subject

Fluid Flow and Transfer Processes,Computer Science Applications,Process Chemistry and Technology,General Engineering,Instrumentation,General Materials Science

Link

https://www.mdpi.com/2076-3417/12/11/5545/pdf

Reference70 articles.

1. Understanding the societal impacts of machine translation: a critical review of the literature on medical and legal use cases

2. Neural machine translation by jointly learning to align and translate;Bahdanau;arXiv,2014

3. Cross-lingual language model pretraining;Lample;arXiv,2019

Cited by 4 articles. 订阅此论文施引文献订阅此论文施引文献，注册后可以免费订阅5篇论文的施引文献，订阅后可以查看论文全部施引文献

1. Identifying Reddit Users at a High Risk of Suicide and Their Linguistic Features During the COVID-19 Pandemic: Growth-Based Trajectory Model;Journal of Medical Internet Research;2024-08-08

2. Doubts on the reliability of parallel corpus filtering;Expert Systems with Applications;2023-12

3. Playing to Save Sisters: How Female Gaming Communities Foster Social Support within Different Cultural Contexts;Journal of Broadcasting & Electronic Media;2023-09-06

4. Uncovering the Risks and Drawbacks Associated With the Use of Synthetic Data for Grammatical Error Correction;IEEE Access;2023