Document Alignment for Generation of English-Punjabi Comparable Corpora from Wikipedia

Author:

Goyal Vishal1,Kumar Ajit2,Lehal Manpreet Singh3

Affiliation:

1. Punjabi University, India

2. Multani Mal Modi College, India

3. Punjabi University

Abstract

Comparable corpora come as an alternative to parallel corpora for the languages where the parallel corpora is scarce. The efficiency of the models trained on comparable corpora is comparatively less to that of the parallel corpora however it helps to compensate much to the machine translation. In this article, the authors have explored Wikipedia as a potential source and delineated the process of alignment of documents which will be further used for the extraction of parallel data. The parallel data thus extracted will help to enhance the performance of Statistical Machine translation.

Publisher

IGI Global

Subject

Computer Networks and Communications,Information Systems

Reference22 articles.

1. On the use of comparable corpora to improve SMT performance

2. Adafre, S. F., & De Rijke, M. (2006). Finding Similar Sentences across Multiple Languages in Wikipedia. In Proceedings of the Workshop on NEW TEXT Wikis and Blogs and Other Dynamic Text Sources. Academic Press.

3. WEB MINING FOR AN AMHARIC - ENGLISH BILINGUAL CORPUS

4. Bouamor, H., & Sajjad, H. (2018). H2@BUCC18: Parallel Sentence Extraction from Comparable Corpora Using Multilingual Sentence Embeddings. In LREC. Academic Press.

5. IEPAD

Cited by 3 articles. 订阅此论文施引文献 订阅此论文施引文献,注册后可以免费订阅5篇论文的施引文献,订阅后可以查看论文全部施引文献

1. A Privacy-Preserving Multilingual Comparable Corpus Construction Method in Internet of Things;Mathematics;2024-02-17

2. A Construction method of Multilingual Comparable Corpus in the background of Artificial Intelligence and Internet of Things;2023 IEEE International Conferences on Internet of Things (iThings) and IEEE Green Computing & Communications (GreenCom) and IEEE Cyber, Physical & Social Computing (CPSCom) and IEEE Smart Data (SmartData) and IEEE Congress on Cybermatics (Cybermatics);2023-12-17

3. Manipuri–English comparable corpus for cross-lingual studies;Language Resources and Evaluation;2022-02-23

同舟云学术

1.学者识别学者识别

2.学术分析学术分析

3.人才评估人才评估

"同舟云学术"是以全球学者为主线,采集、加工和组织学术论文而形成的新型学术文献查询和分析系统,可以对全球学者进行文献检索和人才价值评估。用户可以通过关注某些学科领域的顶尖人物而持续追踪该领域的学科进展和研究前沿。经过近期的数据扩容,当前同舟云学术共收录了国内外主流学术期刊6万余种,收集的期刊论文及会议论文总量共计约1.5亿篇,并以每天添加12000余篇中外论文的速度递增。我们也可以为用户提供个性化、定制化的学者数据。欢迎来电咨询!咨询电话:010-8811{复制后删除}0370

www.globalauthorid.com

TOP

Copyright © 2019-2024 北京同舟云网络信息技术有限公司
京公网安备11010802033243号  京ICP备18003416号-3