A dataset to evaluate Hindi Word Embeddings

Author:

Soni Vimal Kumar,Gopalani Dinesh,Govil M C

Abstract

Abstract The current trend to solve different challenges of Natural Language Processing utilizes various online crawling methods to fetch the data and applying different shallow or deep learning methods to develop models for the respective tasks on this data. Word vectors generated using such methods are being applied for several NLP challenges and such vectors are being evaluated on word similarity task. Not only huge data is available but also multiple datasets are available for the English language to evaluate the performance of the developed models. However, the scenario is not the same for Indian languages specifically for Hindi. Focusing this challenge, we propose a dataset to check word similarity in Hindi. The construction process and afterwards annotation process are described in details. To construct this dataset, first, 353 word-pairs from the most popular English dataset are selected and translated. Their translations are verified by Hindi Experts. These word pairs are finally annotated independently by 11 native Hindi speakers. Multiple criteria have been set to select the annotators for this task. The final dataset has been evaluated on CBOW and Skip-gram models.

Publisher

IOP Publishing

Subject

General Medicine

Reference18 articles.

同舟云学术

1.学者识别学者识别

2.学术分析学术分析

3.人才评估人才评估

"同舟云学术"是以全球学者为主线,采集、加工和组织学术论文而形成的新型学术文献查询和分析系统,可以对全球学者进行文献检索和人才价值评估。用户可以通过关注某些学科领域的顶尖人物而持续追踪该领域的学科进展和研究前沿。经过近期的数据扩容,当前同舟云学术共收录了国内外主流学术期刊6万余种,收集的期刊论文及会议论文总量共计约1.5亿篇,并以每天添加12000余篇中外论文的速度递增。我们也可以为用户提供个性化、定制化的学者数据。欢迎来电咨询!咨询电话:010-8811{复制后删除}0370

www.globalauthorid.com

TOP

Copyright © 2019-2024 北京同舟云网络信息技术有限公司
京公网安备11010802033243号  京ICP备18003416号-3