Evaluating keyphrase extraction algorithms for finding similar news articles using lexical similarity calculation and semantic relatedness measurement by word embedding-Reference-Cited by-同舟云学术

Evaluating keyphrase extraction algorithms for finding similar news articles using lexical similarity calculation and semantic relatedness measurement by word embedding

Published:2022-07-07 Issue: Volume:8 Page:e1024
ISSN:2376-5992
Container-title:PeerJ Computer Science
language:en
Short-container-title:

Author:

Sarwar Talha Bin,Noor Noorhuzaimi Mohd,Saef Ullah Miah M.^ORCID

Abstract

A textual data processing task that involves the automatic extraction of relevant and salient keyphrases from a document that expresses all the important concepts of the document is called keyphrase extraction. Due to technological advancements, the amount of textual information on the Internet is rapidly increasing as a lot of textual information is processed online in various domains such as offices, news portals, or for research purposes. Given the exponential increase of news articles on the Internet, manually searching for similar news articles by reading the entire news content that matches the user’s interests has become a time-consuming and tedious task. Therefore, automatically finding similar news articles can be a significant task in text processing. In this context, keyphrase extraction algorithms can extract information from news articles. However, selecting the most appropriate algorithm is also a problem. Therefore, this study analyzes various supervised and unsupervised keyphrase extraction algorithms, namely KEA, KP-Miner, YAKE, MultipartiteRank, TopicRank, and TeKET, which are used to extract keyphrases from news articles. The extracted keyphrases are used to compute lexical and semantic similarity to find similar news articles. The lexical similarity is calculated using the Cosine and Jaccard similarity techniques. In addition, semantic similarity is calculated using a word embedding technique called Word2Vec in combination with the Cosine similarity measure. The experimental results show that the KP-Miner keyphrase extraction algorithm, together with the Cosine similarity calculation using Word2Vec (Cosine-Word2Vec), outperforms the other combinations of keyphrase extraction algorithms and similarity calculation techniques to find similar news articles. The similar articles identified using KPMiner and the Cosine similarity measure with Word2Vec appear to be relevant to a particular news article and thus show satisfactory performance with a Normalized Discounted Cumulative Gain (NDCG) value of 0.97. This study proposes a method for finding similar news articles that can be used in conjunction with other methods already in use.

Funder

University Malaysia Pahang (UMP) Flagship

Publisher

PeerJ

Subject

General Computer Science

Link

https://peerj.com/articles/cs-1024.pdf

Reference53 articles.

1. Academics’ views on the characteristics of academic writing;Akkaya;Educational Policy Analysis and Strategic Research,2018

2. Query expansion techniques for information retrieval: a survey;Azad;Information Processing & Management,2019

3. Improving performance of text summarization;Babar;Procedia Computer Science,2015

4. An efficient recommendation generation using relevant Jaccard similarity;Bag;Information Sciences,2019

5. Research paper recommender system evaluation: a quantitative literature survey;Beel,2013

Cited by 11 articles. 订阅此论文施引文献订阅此论文施引文献，注册后可以免费订阅5篇论文的施引文献，订阅后可以查看论文全部施引文献

1. Flexible margins and multiple samples learning to enhance lexical semantic similarity;Engineering Applications of Artificial Intelligence;2024-07

2. MULTI-LEVEL CITY PORTRAIT RESEARCH BASED ON MULTI-SOURCE DATA;ISPRS Annals of the Photogrammetry, Remote Sensing and Spatial Information Sciences;2023-12-05

3. Characteristics and evolution of hierarchical fishery policies in China – A textual analysis based on 5311 policies from 2003 to 2022;Marine Policy;2023-09

4. Comparing Manually Added Research Labels and Automatically Extracted Research Keywords to Identify Specialist Researchers in Learning Analytics: A Case Study Using Google Scholar Researcher Profiles;Applied Sciences;2023-06-15

5. Finding Patient Zero and Tracking Narrative Changes in the Context of Online Disinformation Using Semantic Similarity Analysis;Mathematics;2023-04-26