Abstract
The identification of key concepts within unstructured data is of paramount importance in practical applications. Despite the abundance of proposed methods for extracting primary topics, only a few works investigated the influence of text length on the performance of keyword extraction (KE) methods. Specifically, many studies lean on abstracts and titles for content extraction from papers, leaving it uncertain whether leveraging the complete content of papers can yield consistent results. Hence, in this study, we employ a network-based approach to evaluate the concordance between keywords extracted from abstracts and those from the entire papers. Community detection methods are utilized to identify interconnected papers in citation networks. Subsequently, paper clusters are formed to identify salient terms within each cluster, employing a methodology akin to the term frequency-inverse document frequency (tf-idf) approach. Once each cluster has been endowed with its distinctive set of key terms, these selected terms are employed to serve as representative keywords at the paper level. The top-ranked words at the cluster level, which also appear in the abstract, are chosen as keywords for the paper. Our findings indicate that although various community detection methods used in KE yield similar levels of accuracy. Notably, text clustering approaches outperform all citation-based methods, while all approaches yield relatively low accuracy values. We also identified a lack of concordance between keywords extracted from the abstracts and those extracted from the corresponding full-text source. Considering that citations and text clustering yield distinct outcomes, combining them in hybrid approaches could offer improved performance.
Funder
Coordenação de Aperfeiçoamento de Pessoal de Nível Superior - Brasil
CNPq foundation
Fundação de Amparo à Pesquisa do Estado de São Paulo
CNPq-Brazil
Publisher
Public Library of Science (PLoS)
Reference44 articles.
1. Key word extraction for short text via word2vec, doc2vec, and textrank;J Li;Turkish Journal of Electrical Engineering and Computer Sciences,2019
2. Timonen M, Toivanen T, Teng Y, Chen C, He L. Informativeness-based Keyword Extraction from Short Documents. In: KDIR; 2012. p. 411–421.
3. Li W, Zhao J. TextRank algorithm by exploiting Wikipedia for short text keywords extraction. In: 2016 3rd International Conference on Information Science and Control Engineering (ICISCE). IEEE; 2016. p. 683–686.
4. Jiang X, Hu Y, Li H. A ranking approach to keyphrase extraction. In: Proceedings of the 32nd international ACM SIGIR conference on Research and development in information retrieval; 2009. p. 756–757.
5. Inside importance factors of graph-based keyword extraction on Chinese short text;J Chen;ACM Transactions on Asian and Low-Resource Language Information Processing (TALLIP),2020