Abstract
AbstractKeyphrases are the most important phrases of documents that make them suitable for improving natural language processing tasks, including information retrieval, document classification, document visualization, summarization and categorization. Here, we propose a supervised framework augmented by novel extra-textual information derived primarily from Wikipedia. Wikipedia is utilized in such an advantageous way that – unlike most other methods relying on Wikipedia – a full textual index of all the Wikipedia articles is not required by our approach, as we only exploit the category hierarchy and a list of multiword expressions derived from Wikipedia. This approach is not only less resource intensive, but also produces comparable or superior results compared to previous similar works. Our thorough evaluations also suggest that the proposed framework performs consistently well on multiple datasets, being competitive or even outperforming the results obtained by other state-of-the-art methods. Besides introducing features that incorporate extra-textual information, we also experimented with a novel way of representing features that are derived from the POS tagging of the keyphrase candidates.
Publisher
Cambridge University Press (CUP)
Subject
Artificial Intelligence,Linguistics and Language,Language and Linguistics,Software
Reference50 articles.
1. Hasan K. S. , and Ng V. 2010. Conundrums in unsupervised keyphrase extraction: making sense of the state-of-the-art. In Proceedings of the 23rd International Conference on Computational Linguistics: Posters (COLING ’10), Stroudsburg, PA, USA: Association for Computational Linguistics, pp. 365–373.
2. Gabrilovich E. , and Markovitch S. 2007. Computing semantic relatedness using Wikipedia-based explicit semantic analysis. In Proceedings of the 20th International Joint Conference on Artificial Intelligence, pp. 1606–1611.
3. Sag I. A. , Baldwin T. , Bond F. , Copestake A. A. , and Flickinger D. 2002. Multiword expressions: a pain in the neck for NLP. In Proceedings of the Third International Conference on Computational Linguistics and Intelligent Text Processing (CICLing ’02), London, UK, UK: Springer-Verlag, pp. 1–15.
4. Tomokiyo T. , and Hurst M. 2003. A language model approach to keyphrase extraction. In Proceedings of the ACL 2003 Workshop on Multiword Expressions: Analysis, Acquisition and Treatment (MWE ’03), vol. 18. Stroudsburg, PA, USA: Association for Computational Linguistics, pp. 33–40.
5. Wang D. X. , Gao X. , and Andreae P. 2012. DIKEA: domain-independent keyphrase extraction algorithm. In Proceedings of the 25th Australasian Joint Conference on Advances in Artificial Intelligence (AI’12), Berlin, Heidelberg: Springer-Verlag, pp. 719–730.
Cited by
7 articles.
订阅此论文施引文献
订阅此论文施引文献,注册后可以免费订阅5篇论文的施引文献,订阅后可以查看论文全部施引文献