The Web as a Parallel Corpus-Reference-Cited by-同舟云学术

The Web as a Parallel Corpus

Published:2003-09 Issue:3 Volume:29 Page:349-380
ISSN:0891-2017
Container-title:Computational Linguistics
language:en
Short-container-title:Computational Linguistics

Author:

Resnik Philip¹,Smith Noah A.²

Affiliation:

1. University of Maryland, Department of Linguistics and Institute for Advanced Computer Studies, University of Maryland, College Park, MD 20742.

2. Johns Hopkins University, Department of Computer Science and Center for Language and Speech Processing, Johns Hopkins University, Baltimore, MD 21218.

Abstract

Parallel corpora have become an essential resource for work in multilingual natural language processing. In this article, we report on our work using the STRAND system for mining parallel text on the World Wide Web, first reviewing the original algorithm and results and then presenting a set of significant enhancements. These enhancements include the use of supervised learning based on structural features of documents to improve classification performance, a new content-based measure of translational equivalence, and adaptation of the system to take advantage of the Internet Archive for mining parallel text from the Web on a large scale. Finally, the value of these techniques is demonstrated in the construction of a significant parallel corpus for a low-density language pair.

Publisher

MIT Press - Journals

Subject

Artificial Intelligence,Computer Science Applications,Linguistics and Language,Language and Linguistics

Link

https://www.mitpressjournals.org/doi/pdf/10.1162/089120103322711578

Reference7 articles.

1. A hierarchical Dirichlet language model

2. Models of Translational Equivalence among Words

Cited by 176 articles. 订阅此论文施引文献订阅此论文施引文献，注册后可以免费订阅5篇论文的施引文献，订阅后可以查看论文全部施引文献

1. Emerging resources, enduring challenges: a comprehensive study of Kashmiri parallel corpus;AI & SOCIETY;2024-06-14

2. Loanword identification based on web resources: A case study on wikipedia;Computer Speech & Language;2023-06

3. Parallel Corpus Creation for NMT using Web Scraping and Filtering;2023 4th International Conference on Computing and Communication Systems (I3CS);2023-03-16

4. Addressing the Issue of Unavailability of Parallel Corpus Incorporating Monolingual Corpus on PBSMT System for English-Manipuri Translation;Computational Linguistics and Intelligent Text Processing;2023

5. Building Comparable Corpora;Building and Using Comparable Corpora for Multilingual Natural Language Processing;2023