Khmer Sentiment Lexicon Based on PU Learning and Label Propagation Algorithm

Author:

Li Chao1ORCID,Yan Xin1ORCID,Xu Guangyi2ORCID,Deng Zhongying3ORCID,Mo Yuanyuan4ORCID

Affiliation:

1. Faculty of Information Engineering and Automation, Kunming University of Science and Technology, Yunnan Key Laboratory of Artificial Intelligence, Kunming, China

2. Yunnan Nantian Electronic Information Industry Co., Ltd., Yunnan, China

3. Faculty of Information Engineering and Automation, Kunming University of Science and Technology, Yunnan, China

4. School of Southeast & South Asia Languages and Culture, Yunnan Minzu University, Yunnan, China

Abstract

The sentiment lexicon is an important tool for natural language processing tasks. In addition to being able to determine the sentiment polarity of words or phrases, it can assist attribute-level, sentence-level, and text-level sentiment analysis tasks. In light of the fact that tagging data and corpora for the Khmer language are scarce, where most resources related to sentiment lexicons are for English, this paper proposes a method for constructing a sentiment lexicon for Khmer based on Positive-Unlabeled learning (PU Learning) and the label propagation algorithm. Sentiment words are first extracted from a corpus using the Spy technique of PU learning method. The main idea is to purify the set of N-class examples, train the MLP classifier, and continuously delete spy words and increase the number of P-class words in the iterative process. Following this, the sentiment polarity of the candidate words is determined. By considering the problem of determining the sentiment polarity of the candidate words as one of calculating its probability distribution, a small number of labeled sentiment words and candidate words are used to construct a graph model. The contextual information of the candidate words is used to construct a simple supplementary graph model of the set of sentiment words through word co-occurrence and triangulation, where this enhances the correlation between data items. The sentiment polarity of the candidate words is then determined through the label propagation algorithm. The results of experiments show that the proposed method can be used to construct a Khmer sentiment lexicon with a small number of labeled data and a small corpus without requiring excessive manual labeling.

Funder

National Nature Science Foundation of China

Publisher

Association for Computing Machinery (ACM)

Subject

General Computer Science

Reference33 articles.

1. Mining and summarizing customer reviews

2. Predicting the semantic orientation of adjectives

3. X. Zhu and Z. Ghahramani. 2002. Learning from Labels and Unlabeled Data with Label Propagation [J]. Tech. Rep. Technical Report CMU-CALD-02.107 2002.

同舟云学术

1.学者识别学者识别

2.学术分析学术分析

3.人才评估人才评估

"同舟云学术"是以全球学者为主线,采集、加工和组织学术论文而形成的新型学术文献查询和分析系统,可以对全球学者进行文献检索和人才价值评估。用户可以通过关注某些学科领域的顶尖人物而持续追踪该领域的学科进展和研究前沿。经过近期的数据扩容,当前同舟云学术共收录了国内外主流学术期刊6万余种,收集的期刊论文及会议论文总量共计约1.5亿篇,并以每天添加12000余篇中外论文的速度递增。我们也可以为用户提供个性化、定制化的学者数据。欢迎来电咨询!咨询电话:010-8811{复制后删除}0370

www.globalauthorid.com

TOP

Copyright © 2019-2024 北京同舟云网络信息技术有限公司
京公网安备11010802033243号  京ICP备18003416号-3