Exploring Web-based Translation Resources Applied to Hindi-English Cross-Lingual Information Retrieval

Author:

Sharma Vijay Kumar,Mittal Namita1,Vidyarthi Ankit2,Gupta Deepak3

Affiliation:

1. Malaviya National Institute of Technology Department of Computer Science & Engineering, India

2. Jaypee Institute of Information Technology Noida Department of CSE&IT, India

3. Maharaja Agrasen Institute of Technology, Delhi and Chandigarh University, Mohali Department of CSE and Reseach Advisor, UCRD, Mohali, India

Abstract

Internet users perceive a multilingual web but are unfamiliar with it due to communication in their regional language called Cross-Lingual Information Retrieval (CLIR). In CLIR, a translation technique is used to translate the user queries into the target documents language. Conventional translation techniques are based on either a manual dictionary or a parallel corpus. While the trending Statistical Machine Translation (SMT) and Neural Machine Translation (NMT) techniques are trained on a parallel corpus. NMT is not so mature for Hindi-English translation, according to the literature, SMT performs better than the NMT. SMT provides a static translation due to the limited vocabularies in the available parallel corpus. It may not provide the translations for missing or unseen words while the web provides a dynamic interface where multiple users are updating information at the same time. The web may provide the translations for missing or unseen words, therefore, the web is effectively used for technically developed languages like English, German, Spanish, Russian, and Chinese. In this paper, different web resources such as Wikipedia, Hindi WordNet & Indo WordNet, ConceptNet, and online dictionary-based translation techniques are proposed and applied to Hindi-English CLIR. Wikipedia-based translation approach incorporates three modules, i.e., exactly matched, partially matched, and disambiguation to address the issues of wrong inter-wiki links, partially matched terms, and ambiguous articles. Hindi WordNet & Indo WorNet attribute ”English synset” and ConceptNet attributes ”Related term” & ”Synonymy” are used for obtaining translations. Further, WordNet path similarity is used to disambiguate translations. Various online dictionaries are available that return multiple relevant and irrelevant translations. The proposed approaches are compared to the SMT where the Wikipedia-based approach achieves approximately similar mean average precision to SMT.

Publisher

Association for Computing Machinery (ACM)

Subject

General Computer Science

Reference45 articles.

1. Mustafa Abusalah John Tait and Michael Oakes. 2005. Literature review of cross-language information retrieval. In Transactions on Engineering Computing and Technology ISSN. Citeseer. Mustafa Abusalah John Tait and Michael Oakes. 2005. Literature review of cross-language information retrieval. In Transactions on Engineering Computing and Technology ISSN. Citeseer.

2. Language independent identification of parallel sentences using Wikipedia

3. Paheli Bhattacharya , Pawan Goyal , and Sudeshna Sarkar . 2016. Using word embeddings for query translation for hindi to english cross language information retrieval. Computación y Sistemas 20, 3 ( 2016 ), 435–447. Paheli Bhattacharya, Pawan Goyal, and Sudeshna Sarkar. 2016. Using word embeddings for query translation for hindi to english cross language information retrieval. Computación y Sistemas 20, 3 (2016), 435–447.

4. Pushpak Bhattacharyya and others. 2017. Indowordnets help in Indian Language Machine Translation. arXiv preprint arXiv:1710.02086(2017). Pushpak Bhattacharyya and others. 2017. Indowordnets help in Indian Language Machine Translation. arXiv preprint arXiv:1710.02086(2017).

5. Ondrej Bojar Vojtech Diatka Pavel Rychlỳ Pavel Stranák Vít Suchomel Ales Tamchyna and Daniel Zeman. 2014. HindEnCorp-Hindi-English and Hindi-only Corpus for Machine Translation.. In LREC. 3550–3555. Ondrej Bojar Vojtech Diatka Pavel Rychlỳ Pavel Stranák Vít Suchomel Ales Tamchyna and Daniel Zeman. 2014. HindEnCorp-Hindi-English and Hindi-only Corpus for Machine Translation.. In LREC. 3550–3555.

Cited by 1 articles. 订阅此论文施引文献 订阅此论文施引文献,注册后可以免费订阅5篇论文的施引文献,订阅后可以查看论文全部施引文献

1. A System for Language Translation using Sequence-to-sequence Learning based Encoder;2023 International Conference on Emerging Smart Computing and Informatics (ESCI);2023-03-01

同舟云学术

1.学者识别学者识别

2.学术分析学术分析

3.人才评估人才评估

"同舟云学术"是以全球学者为主线,采集、加工和组织学术论文而形成的新型学术文献查询和分析系统,可以对全球学者进行文献检索和人才价值评估。用户可以通过关注某些学科领域的顶尖人物而持续追踪该领域的学科进展和研究前沿。经过近期的数据扩容,当前同舟云学术共收录了国内外主流学术期刊6万余种,收集的期刊论文及会议论文总量共计约1.5亿篇,并以每天添加12000余篇中外论文的速度递增。我们也可以为用户提供个性化、定制化的学者数据。欢迎来电咨询!咨询电话:010-8811{复制后删除}0370

www.globalauthorid.com

TOP

Copyright © 2019-2024 北京同舟云网络信息技术有限公司
京公网安备11010802033243号  京ICP备18003416号-3