Research on Text Similarity Measurement Hybrid Algorithm with Term Semantic Information and TF-IDF Method

Author:

Lan Fei1ORCID

Affiliation:

1. School of Electronics and Internet of Things, Chongqing College of Electronic Engineering, Chongqing 400000, China

Abstract

TF-IDF (term frequency-inverse document frequency) is one of the traditional text similarity calculation methods based on statistics. Because TF-IDF does not consider the semantic information of words, it cannot accurately reflect the similarity between texts, and semantic information enhanced methods distinguish between text documents poorly because extended vectors with semantic similar terms aggravate the curse of dimensionality. Aiming at this problem, this paper advances a hybrid with the semantic understanding and TF-IDF to calculate the similarity of texts. Based on term similarity weighting tree (TSWT) data structure and the definition of semantic similarity information from the HowNet, the paper firstly discusses text preprocess and filter process and then utilizes the semantic information of those key terms to calculate similarities of text documents according to the weight of the features whose weight is greater than the given threshold. The experimental results show that the hybrid method is better than the pure TF-IDF and the method of semantic understanding at the aspect of accuracy, recall, and F1-metric by different K-means clustering methods.

Funder

Chongqing Science and Technology Commission of China

Publisher

Hindawi Limited

Subject

General Computer Science

Reference20 articles.

1. Emotional change detection oriented speech emotion database;H. Zhang;Computer Simulation,2021

2. Text sentiment classification model based on TF-IDF weighted convolutional neural network;C. Li;Journal of Chongqing University of Technology (Natural Science),2021

3. A Near-Duplicate Image Detection System for Design Contents Using SIFT

4. Text similarity calculation based on ‘HowNet’ original space;Z. Xiao;Computer Science and Engineering,2013

5. Research on text similarity calculation based on weighted semantic network;K. Liao;Journal of Information,2012

Cited by 17 articles. 订阅此论文施引文献 订阅此论文施引文献,注册后可以免费订阅5篇论文的施引文献,订阅后可以查看论文全部施引文献

1. ICRM: An intelligent citation recommendation mechanism based on BERT and weighted BoW models;Journal of Intelligent & Fuzzy Systems;2024-04-18

2. Comparative Study and Analysis in Text Summarization Literature;2024 6th International Conference on Computing and Informatics (ICCI);2024-03-06

3. Method for Data Retrieval Intent Recognition Targeting Complex Grid Control Operations;2024 IEEE 4th International Conference on Power, Electronics and Computer Applications (ICPECA);2024-01-26

4. A New Hybrid Search Approach to Optimize the Retrieval of Information from the Website at the Universidad Politécnica Salesiana;Lecture Notes in Networks and Systems;2024

5. The Role of Automated Classification in Preserving Indonesian Folk and National Songs;Lecture Notes in Computer Science;2024

同舟云学术

1.学者识别学者识别

2.学术分析学术分析

3.人才评估人才评估

"同舟云学术"是以全球学者为主线,采集、加工和组织学术论文而形成的新型学术文献查询和分析系统,可以对全球学者进行文献检索和人才价值评估。用户可以通过关注某些学科领域的顶尖人物而持续追踪该领域的学科进展和研究前沿。经过近期的数据扩容,当前同舟云学术共收录了国内外主流学术期刊6万余种,收集的期刊论文及会议论文总量共计约1.5亿篇,并以每天添加12000余篇中外论文的速度递增。我们也可以为用户提供个性化、定制化的学者数据。欢迎来电咨询!咨询电话:010-8811{复制后删除}0370

www.globalauthorid.com

TOP

Copyright © 2019-2024 北京同舟云网络信息技术有限公司
京公网安备11010802033243号  京ICP备18003416号-3