Document Alignment for Generation of English-Punjabi Comparable Corpora from Wikipedia-Reference-Cited by-同舟云学术

Document Alignment for Generation of English-Punjabi Comparable Corpora from Wikipedia

Published:2020-01 Issue:1 Volume:12 Page:42-51
ISSN:1937-9633
Container-title:International Journal of E-Adoption
language:en
Short-container-title:

Author:

Goyal Vishal¹,Kumar Ajit²,Lehal Manpreet Singh³

Affiliation:

1. Punjabi University, India

2. Multani Mal Modi College, India

3. Punjabi University

Abstract

Comparable corpora come as an alternative to parallel corpora for the languages where the parallel corpora is scarce. The efficiency of the models trained on comparable corpora is comparatively less to that of the parallel corpora however it helps to compensate much to the machine translation. In this article, the authors have explored Wikipedia as a potential source and delineated the process of alignment of documents which will be further used for the extraction of parallel data. The parallel data thus extracted will help to enhance the performance of Statistical Machine translation.

Publisher

IGI Global

Subject

Computer Networks and Communications,Information Systems

Reference22 articles.

1. On the use of comparable corpora to improve SMT performance

2. Adafre, S. F., & De Rijke, M. (2006). Finding Similar Sentences across Multiple Languages in Wikipedia. In Proceedings of the Workshop on NEW TEXT Wikis and Blogs and Other Dynamic Text Sources. Academic Press.

3. WEB MINING FOR AN AMHARIC - ENGLISH BILINGUAL CORPUS

4. Bouamor, H., & Sajjad, H. (2018). H2@BUCC18: Parallel Sentence Extraction from Comparable Corpora Using Multilingual Sentence Embeddings. In LREC. Academic Press.

5. IEPAD

Cited by 3 articles. 订阅此论文施引文献订阅此论文施引文献，注册后可以免费订阅5篇论文的施引文献，订阅后可以查看论文全部施引文献

1. A Privacy-Preserving Multilingual Comparable Corpus Construction Method in Internet of Things;Mathematics;2024-02-17

2. A Construction method of Multilingual Comparable Corpus in the background of Artificial Intelligence and Internet of Things;2023 IEEE International Conferences on Internet of Things (iThings) and IEEE Green Computing & Communications (GreenCom) and IEEE Cyber, Physical & Social Computing (CPSCom) and IEEE Smart Data (SmartData) and IEEE Congress on Cybermatics (Cybermatics);2023-12-17

3. Manipuri–English comparable corpus for cross-lingual studies;Language Resources and Evaluation;2022-02-23