Building a multi-domain comparable corpus using a learning to rank method-Reference-Cited by-同舟云学术

Building a multi-domain comparable corpus using a learning to rank method

Published:2016-06-15 Issue:4 Volume:22 Page:627-653
ISSN:1351-3249
Container-title:Natural Language Engineering
language:en
Short-container-title:Nat. Lang. Eng.

Author:

RAHIMI RAZIEH,SHAKERY AZADEH,DADASHKARIMI JAVID,ARIANNEZHAD MOZHDEH,DEHGHANI MOSTAFA,ESFAHANI HOSSEIN NASR

Abstract

AbstractComparable corpora are key translation resources for both languages and domains with limited linguistic resources. The existing approaches for building comparable corpora are mostly based on ranking candidate documents in the target language for each source document using a cross-lingual retrieval model. These approaches also exploit other evidence of document similarity, such as proper names and publication dates, to build more reliable alignments. However, the importance of each evidence in the scores of candidate target documents is determined heuristically. In this paper, we employ a learning to rank method for ranking candidate target documents with respect to each source document. The ranking model is constructed by defining each evidence for similarity of bilingual documents as a feature whose weight is learned automatically. Learning feature weights can significantly improve the quality of alignments, because the reliability of features depends on the characteristics of both source and target languages of a comparable corpus. We also propose a method to generate appropriate training data for the task of building comparable corpora. We employed the proposed learning-based approach to build a multi-domain English–Persian comparable corpus which covers twelve different domains obtained from Open Directory Project. Experimental results show that the created alignments have high degrees of comparability. Comparison with existing approaches for building comparable corpora shows that our learning-based approach improves both quality and coverage of alignments.

Publisher

Cambridge University Press (CUP)

Subject

Artificial Intelligence,Linguistics and Language,Language and Linguistics,Software

Reference46 articles.

1. Braschler M. and Schäuble P. 1998. Multilingual information retrieval based on document alignment techniques. In Proceedings of the Second European Conference on Research and Advanced Technology for Digital Libraries, ECDL'98, London, UK: Springer-Verlag, pp. 183–197.

2. Extracting translations from comparable corpora for Cross-Language Information Retrieval using the language modeling framework

3. Munteanu D. S. and Marcu D. 2006. Extracting parallel sub-sentential fragments from non-parallel corpora. In Proceedings of the 21st International Conference on Computational Linguistics and the 44th Annual Meeting of the Association for Computational Linguistics, ACL-44, Stroudsburg, PA, USA. Association for Computational Linguistics, pp. 81–88.

4. Improving Machine Translation Performance by Exploiting Non-Parallel Corpora

5. Gaussier E. , Renders J.-M. , Matveeva I. , Goutte C. , and Déjean H. 2004. A geometric view on bilingual lexicon extraction from comparable corpora. In Proceedings of the 42nd Annual Meeting on Association for Computational Linguistics, ACL'04, Stroudsburg, PA, USA: Association for Computational Linguistics, pp. 527–534.

Cited by 2 articles. 订阅此论文施引文献订阅此论文施引文献，注册后可以免费订阅5篇论文的施引文献，订阅后可以查看论文全部施引文献

1. Matching Graph, a Method for Extracting Parallel Information from Comparable Corpora;ACM Transactions on Asian and Low-Resource Language Information Processing;2020-01-09

2. SS4MCT: A Statistical Stemmer for Morphologically Complex Texts;Lecture Notes in Computer Science;2016