Affiliation:
1. Thiagarajar College of Engineering, India
2. KLN College of Information Technology, India
Abstract
This article describes how semantic annotation is the most important need for the categorization of labeled or unlabeled textual documents. Accuracy of document categorization can be greatly improved if documents are indexed or modeled using the semantics rather than the traditional term-frequency model. This annotation has its own challenges like synonymy and polysemy in the document categorization problem. The model proposes to build domain ontology for the textual content so that the problems like synonymy and polysemy in text analysis are resolved to greater extent. Latent Dirichlet Allocation (LDA), the topic modeling technique has been used for feature extraction from the documents. Using the domain knowledge on the concept and the features grouped by LDA, the domain ontology is built in the hierarchical fashion. Empirical results show that LDA is the better feature extraction technique for text documents than TF or TF-IDF indexing technique. Also, the proposed model shows improvement in the accuracy of document categorization when domain ontology built using LDA has been used for document indexing.
Cited by
2 articles.
订阅此论文施引文献
订阅此论文施引文献,注册后可以免费订阅5篇论文的施引文献,订阅后可以查看论文全部施引文献
1. An Opinion Mining Approach for Drug Reviews in Spanish;Research Anthology on Implementing Sentiment Analysis Across Multiple Disciplines;2022-06-10
2. An Opinion Mining Approach for Drug Reviews in Spanish;Handbook of Research on Natural Language Processing and Smart Service Systems;2021