Author:
Fudholi Dhomas Hatta,Juwairi Kiki Purnama
Abstract
Abstract
The medical domain has always been an all-time important domain since healthiness is everyone’s purpose. People find medical document resources in the sea of data and information, such as the web. To support information retrieval and knowledge dissemination through the web, we analyze the use of semi-supervised learning to classify medical-related documents. The semi-supervised learning technique is chosen to show the possibilities of creating good classifiers with limited human supervision. In this research, we use the Naïve Bayes and Pseudo Labeling technique. We analyze different labeled:unlabeled data ratios of the training dataset in the experiment, starting from 4:3, 3:4, 2:5, and 1:6, to see the semi-supervised learning performance with different levels of human supervision. We get a relatively similar result in terms of classification average accuracy (81%-83%). Interestingly, in one experiment, the highest accuracy of the 1:6 ratio (85%) outperforms the 2:5 ratio (82%) and has the same accuracy as the 4:3 (85%). However, the standard deviation of the accuracy in the 1:6 ratio is the highest, amongst others (4.183). Finally, semi-supervised learning can be used to create a great classifier model of the medical domain in Bahasa Indonesia with less human supervision.
Reference15 articles.
1. A pseudo label based dataless naive bayes algorithm for text classification with seed words;Li,2018
2. A novel approach for ontology-based dimensionality reduction for web text document classification;Elhadad,2017
3. Using unsupervised information to improve semi-supervised tweet sentiment classification;Da Silva;J. Inf. Sci.,2016
4. Graph-based semi-supervised learning for natural language understanding;Qiu,2019
Cited by
2 articles.
订阅此论文施引文献
订阅此论文施引文献,注册后可以免费订阅5篇论文的施引文献,订阅后可以查看论文全部施引文献