Word Sense Clustering and Clusterability

Author:

McCarthy Diana1,Apidianaki Marianna2,Erk Katrin3

Affiliation:

1. University of Cambridge

2. LIMSI, CNRS, Université Paris-Saclay

3. University of Texas at Austin

Abstract

Word sense disambiguation and the related field of automated word sense induction traditionally assume that the occurrences of a lemma can be partitioned into senses. But this seems to be a much easier task for some lemmas than others. Our work builds on recent work that proposes describing word meaning in a graded fashion rather than through a strict partition into senses; in this article we argue that not all lemmas may need the more complex graded analysis, depending on their partitionability. Although there is plenty of evidence from previous studies and from the linguistics literature that there is a spectrum of partitionability of word meanings, this is the first attempt to measure the phenomenon and to couple the machine learning literature on clusterability with word usage data used in computational linguistics. We propose to operationalize partitionability as clusterability, a measure of how easy the occurrences of a lemma are to cluster. We test two ways of measuring clusterability: (1) existing measures from the machine learning literature that aim to measure the goodness of optimal k-means clusterings, and (2) the idea that if a lemma is more clusterable, two clusterings based on two different “views” of the same data points will be more congruent. The two views that we use are two different sets of manually constructed lexical substitutes for the target lemma, on the one hand monolingual paraphrases, and on the other hand translations. We apply automatic clustering to the manual annotations. We use manual annotations because we want the representations of the instances that we cluster to be as informative and “clean” as possible. We show that when we control for polysemy, our measures of clusterability tend to correlate with partitionability, in particular some of the type-(1) clusterability measures, and that these measures outperform a baseline that relies on the amount of overlap in a soft clustering.

Publisher

MIT Press - Journals

Subject

Artificial Intelligence,Computer Science Applications,Linguistics and Language,Language and Linguistics

Reference42 articles.

1. Ackerman, Margareta and Shai Ben-David. 2009a. Clusterability: A theoretical study. Journal of Machine Learning Research Proceedings Track, 5:1–8.

2. Ackerman, Margareta and Shai Ben-David. 2009b. Clusterability: A theoretical study. In Proceedings of the Twelfth International Conference on Artificial Intelligence and Statistics (AISTATS), pages 1–8, Clearwater Beach, FL.

3. Apidianaki, Marianna. 2008. Translation-oriented word sense induction based on parallel corpora. In Proceedings of the Sixth International Conference on Language Resources and Evaluation (LREC'08), pages 3269–3275, Marrakech.

4. Data-driven semantic analysis for multilingual WSD and lexical selection in translation

5. Apidianaki, Marianna, Emilia Verzeni, and Diana McCarthy. 2014. Semantic clustering of pivot paraphrases. In Proceedings of the Ninth International Conference on Language Resources and Evaluation (LREC'14), pages 4270–4275, Reykjavik.

同舟云学术

1.学者识别学者识别

2.学术分析学术分析

3.人才评估人才评估

"同舟云学术"是以全球学者为主线,采集、加工和组织学术论文而形成的新型学术文献查询和分析系统,可以对全球学者进行文献检索和人才价值评估。用户可以通过关注某些学科领域的顶尖人物而持续追踪该领域的学科进展和研究前沿。经过近期的数据扩容,当前同舟云学术共收录了国内外主流学术期刊6万余种,收集的期刊论文及会议论文总量共计约1.5亿篇,并以每天添加12000余篇中外论文的速度递增。我们也可以为用户提供个性化、定制化的学者数据。欢迎来电咨询!咨询电话:010-8811{复制后删除}0370

www.globalauthorid.com

TOP

Copyright © 2019-2024 北京同舟云网络信息技术有限公司
京公网安备11010802033243号  京ICP备18003416号-3