Improving the state-of-the-art in Thai semantic similarity using distributional semantics and ontological information

Author:

Netisopakul PonrudeeORCID,Wohlgenannt Gerhard,Pulich Aleksei,Hlaing Zar Zar

Abstract

Research into semantic similarity has a long history in lexical semantics, and it has applications in many natural language processing (NLP) tasks like word sense disambiguation or machine translation. The task of calculating semantic similarity is usually presented in the form of datasets which contain word pairs and a human-assigned similarity score. Algorithms are then evaluated by their ability to approximate the gold standard similarity scores. Many such datasets, with different characteristics, have been created for English language. Recently, four of those were transformed to Thai language versions, namely WordSim-353, SimLex-999, SemEval-2017-500, and R&G-65. Given those four datasets, in this work we aim to improve the previous baseline evaluations for Thai semantic similarity and solve challenges of unsegmented Asian languages (particularly the high fraction of out-of-vocabulary (OOV) dataset terms). To this end we apply and integrate different strategies to compute similarity, including traditional word-level embeddings, subword-unit embeddings, and ontological or hybrid sources like WordNet and ConceptNet. With our best model, which combines self-trained fastText subword embeddings with ConceptNet Numberbatch, we managed to raise the state-of-the-art, measured with the harmonic mean of Pearson on Spearman ρ, by a large margin from 0.356 to 0.688 for TH-WordSim-353, from 0.286 to 0.769 for TH-SemEval-500, from 0.397 to 0.717 for TH-SimLex-999, and from 0.505 to 0.901 for TWS-65.

Funder

King Mongkut’s Institute of Technology Ladkrabang

the ITMO Fellowship and Professorship Program

Publisher

Public Library of Science (PLoS)

Subject

Multidisciplinary

Reference56 articles.

1. Barzegar S, Davis B, Zarrouk M, Handschuh S, Freitas A. SemR-11: A Multi-Lingual Gold-Standard for Semantic Similarity and Relatedness for Eleven Languages. In: LREC-2018; 2018.

2. Camacho-Collados J, Pilehvar MT, Collier N, Navigli R. SemEval-2017 Task 2: Multilingual and Cross-lingual Semantic Word Similarity. In: Proc. of the 11th Int. Workshop on Semantic Evaluation (SemEval-2017). Association for Computational Linguistics; 2017. p. 15–26. Available from: http://aclweb.org/anthology/S17-2002.

3. Contextual Correlates of Synonymy;H Rubenstein;Commun ACM,1965

4. Simlex-999: Evaluating semantic models with (genuine) similarity estimation;F Hill;Computational Linguistics,2015

5. Miller T, Biemann C, Zesch T, Gurevych I. Using Distributional Similarity for Lexical Expansion in Knowledge-based Word Sense Disambiguation. In: Proceedings of the 24th International Conference on Computational Linguistics (COLING 2012). Mumbai, India; 2012. p. 1781–1796. Available from: http://www.aclweb.org/anthology/C12-1109.

Cited by 4 articles. 订阅此论文施引文献 订阅此论文施引文献,注册后可以免费订阅5篇论文的施引文献,订阅后可以查看论文全部施引文献

同舟云学术

1.学者识别学者识别

2.学术分析学术分析

3.人才评估人才评估

"同舟云学术"是以全球学者为主线,采集、加工和组织学术论文而形成的新型学术文献查询和分析系统,可以对全球学者进行文献检索和人才价值评估。用户可以通过关注某些学科领域的顶尖人物而持续追踪该领域的学科进展和研究前沿。经过近期的数据扩容,当前同舟云学术共收录了国内外主流学术期刊6万余种,收集的期刊论文及会议论文总量共计约1.5亿篇,并以每天添加12000余篇中外论文的速度递增。我们也可以为用户提供个性化、定制化的学者数据。欢迎来电咨询!咨询电话:010-8811{复制后删除}0370

www.globalauthorid.com

TOP

Copyright © 2019-2024 北京同舟云网络信息技术有限公司
京公网安备11010802033243号  京ICP备18003416号-3