Improving Machine Translation Performance by Exploiting Non-Parallel Corpora-Reference-Cited by-同舟云学术

Improving Machine Translation Performance by Exploiting Non-Parallel Corpora

Published:2005-12 Issue:4 Volume:31 Page:477-504
ISSN:0891-2017
Container-title:Computational Linguistics
language:en
Short-container-title:Computational Linguistics

Author:

Munteanu Dragos Stefan¹,Marcu Daniel¹

Affiliation:

1. Information Sciences Institute, University of Southern California, 4676 Admiralty Way, Suite 1001, Marina del Rey, CA 90292

Abstract

We present a novel method for discovering parallel sentences in comparable, non-parallel corpora. We train a maximum entropy classifier that, given a pair of sentences, can reliably determine whether or not they are translations of each other. Using this approach, we extract parallel data from large Chinese, Arabic, and English non-parallel newspaper corpora. We evaluate the quality of the extracted data by showing that it improves the performance of a state-of-the-art statistical machine translation system. We also show that a good-quality MT system can be built from scratch by starting with a very small parallel corpus (100,000 words) and exploiting a large non-parallel corpus. Thus, our method can be applied with great benefit to language pairs for which only scarce resources are available.

Publisher

MIT Press - Journals

Subject

Artificial Intelligence,Computer Science Applications,Linguistics and Language,Language and Linguistics

Link

https://www.mitpressjournals.org/doi/pdf/10.1162/089120105775299168

Reference6 articles.

1. A Systematic Comparison of Various Statistical Alignment Models

Cited by 111 articles. 订阅此论文施引文献订阅此论文施引文献，注册后可以免费订阅5篇论文的施引文献，订阅后可以查看论文全部施引文献

1. Building English – Punjabi Aligned Parallel Corpora of Nouns from Comparable Corpora;Applied Computer Systems;2023-12-01

2. LegalNERo: A linked corpus for named entity recognition in the Romanian legal domain;Semantic Web;2023-06-05

3. Conclusions and Future Research;Building and Using Comparable Corpora for Multilingual Natural Language Processing;2023

4. Extraction of Parallel Sentences;Building and Using Comparable Corpora for Multilingual Natural Language Processing;2023

5. Tailoring and evaluating the Wikipedia for in-domain comparable corpora extraction;Knowledge and Information Systems;2022-11-01