Named Entities as Key Features for Detecting Semantically Similar News Articles-Reference-Cited by-同舟云学术

Named Entities as Key Features for Detecting Semantically Similar News Articles

Published:2023-07-29 Issue:04 Volume:17 Page:633-649
ISSN:1793-351X
Container-title:International Journal of Semantic Computing
language:en
Short-container-title:Int. J. Semantic Computing

Author:

Stockem Novo Anne¹,Gedikli Fatih¹

Affiliation:

1. Institute of Computer Science, Ruhr West University of Applied Sciences, Duisburger Straße 100, 45479 Mülheim an der Ruhr, Germany

Abstract

The focus of this work is detecting semantically similar news articles for search engines and recommender systems which is an important step towards processing and understanding natural language. Search engines and recommender systems typically filter out near-duplicate articles which are often just a paraphrasing of a previous article and therefore irrelevant for the users. Articles with a high level of overlapping content are not interesting to the reader and should be avoided. Here, we focus on named entities, such as people, organizations and places, and their role as a key feature for identifying near-duplicate articles. Since our dataset from the energy business contains a significant amount of paraphrased articles, standard techniques, e.g. based on the Jaccard coefficient, already serve quite well. A fine-tuned BERT model evaluated on named entities achieves best model results with more than 97% accuracy and highest True Positive Rates. The importance of individual words for the model decisions is evaluated by computing their Shapley values. It was found that the explanations are in overall good agreement with the human intuitive interpretation.

Publisher

World Scientific Pub Co Pte Ltd

Subject

Artificial Intelligence,Computer Networks and Communications,Computer Science Applications,Linguistics and Language,Information Systems,Software

Link

https://www.worldscientific.com/doi/pdf/10.1142/S1793351X23300030

Reference18 articles.

1. Classifier selection for majority voting