Affiliation:
1. Symbiosis Institute of Computer Studies and Research, Symbiosis International (Deemed) University, Pune, India
Abstract
In this era, news is not only generated continuously with high speed but also growing in its amount by different web sources like talent hunt, news agencies, and so on. To predict the exact class of news depending on its topic, GepH (Grouped entity predictor for Hindi) is proposed using entity extraction and grouping. Entity extraction is popular for English corpus. Hindi is a national language due to its resource scarceness not being explored so much by researchers. More than 1,270 news are processed to apply entity extraction, clustering, and classification using the vector space model for Hindi (VSMH), Synset vector space model for Hindi (SVSMH), and grouped entity document matrix for Hindi (GEDMH). Synset-based dimension reduction techniques are used to get improved accuracy. Evaluation of HAC using three matrices shows the best performance of GEDMH for varied datasets. Thus labelled corpus obtained after applying HAC (Hierarchical agglomerative clustering) to GEDMH is used as a training dataset and predictions are done using random forest and Naïve Bayes. The Naïve Bayes classifier implemented using the proposed GEDMH performs the best. GepH shows 0.8 purity, 0.4 entropy, and 0.3 as error rate for 1,273 Hindi news.
Publisher
World Scientific Pub Co Pte Ltd
Subject
Library and Information Sciences,Computer Networks and Communications,Computer Science Applications