Segmentation and alignment of parallel text for statistical machine translation-Reference-Cited by-同舟云学术

Segmentation and alignment of parallel text for statistical machine translation

Published:2006-07-06 Issue:3 Volume:13 Page:235-260
ISSN:1351-3249
Container-title:Natural Language Engineering
language:en
Short-container-title:Nat. Lang. Eng.

Author:

DENG YONGGANG,KUMAR SHANKAR,BYRNE WILLIAM

Abstract

We address the problem of extracting bilingual chunk pairs from parallel text to create training sets for statistical machine translation. We formulate the problem in terms of a stochastic generative process over text translation pairs, and derive two different alignment procedures based on the underlying alignment model. The first procedure is a now-standard dynamic programming alignment model which we use to generate an initial coarse alignment of the parallel text. The second procedure is a divisive clustering parallel text alignment procedure which we use to refine the first-pass alignments. This latter procedure is novel in that it permits the segmentation of the parallel text into sub-sentence units which are allowed to be reordered to improve the chunk alignment. The quality of chunk pairs are measured by the performance of machine translation systems trained from them. We show practical benefits of divisive clustering as well as how system performance can be improved by exploiting portions of the parallel text that otherwise would have to be discarded. We also show that chunk alignment as a first step in word alignment can significantly reduce word alignment error rate.

Publisher

Cambridge University Press (CUP)

Subject

Artificial Intelligence,Linguistics and Language,Language and Linguistics,Software

Cited by 8 articles. 订阅此论文施引文献订阅此论文施引文献，注册后可以免费订阅5篇论文的施引文献，订阅后可以查看论文全部施引文献

1. Divide and Conquer Approach to Long Genomic Sequence Alignment;2021 11th International Conference on Computer Engineering and Knowledge (ICCKE);2021-10-28

2. Input Use Efficiency in Rice–Wheat Cropping Systems to Manage the Footprints for Food and Environmental Security;Input Use Efficiency for Food and Environmental Security;2021

3. Statistical Approach to Noisy-Parallel and Comparable Corpora Filtering for the Extraction of Bi-lingual Equivalent Data at Sentence-Level;Advances in Intelligent Systems and Computing;2018

4. Sequence Alignment as a Set Partitioning Problem;Journal of Natural Language Processing;2016

5. Towards non-monotonic sentence alignment;Information Sciences;2015-12