On classification of abstracts obtained from medical journals

Author:

Parlak Bekir1ORCID,Uysal Alper Kürşat1

Affiliation:

1. Department of Computer Engineering, Faculty of Engineering, Anadolu University, Turkey

Abstract

Classification of medical documents was mostly carried out on English data sets and these studies were performed on hospital records rather than academic texts. The main reasons behind this situation are the lack of publicly available data sets and the tasks being costly and time-consuming. As the first contribution of this study, two data sets including Turkish and English counterparts of the same abstracts published in Turkish medical journals were constructed. Turkish is one of the widely used agglutinative languages worldwide and English is a good example of non-agglutinative languages. While English abstracts were obtained automatically from MEDLINE database with a computer program, Turkish counterparts of these documents were collected manually from the Internet. As the second contribution of this study, an extensive comparison on classification of abstracts obtained from Turkish medical journals was made by using these two equivalent data sets. Features were extracted from text documents with three different approaches: unigram, bigram and hybrid. Hybrid approach includes a combination of unigram and bigram features. In the experiments, three different feature selection methods and seven different classifiers were utilised. According to the results on both data sets, classification performance of the English abstracts outperformed the Turkish counterparts. Maximum accuracies were obtained from the combination of unigram features, distinguishing feature selector (DFS) and multinomial naïve Bayes (MNB) classifier for both data sets. Unigram features were generally more efficient than bigram and hybrid features. However, analysis of top-10 features indicated that nearly half of the features were translations of each other for Turkish and English data sets.

Funder

Anadolu Üniversitesi

Publisher

SAGE Publications

Subject

Library and Information Sciences,Information Systems

Cited by 19 articles. 订阅此论文施引文献 订阅此论文施引文献,注册后可以免费订阅5篇论文的施引文献,订阅后可以查看论文全部施引文献

1. A RULE-BASED APPROACH USING THE ROUGH SET ON COVID-19 DATA;Eskişehir Osmangazi Üniversitesi Mühendislik ve Mimarlık Fakültesi Dergisi;2024-08-12

2. An improved data augmentation approach and its application in medical named entity recognition;BMC Medical Informatics and Decision Making;2024-08-05

3. Öznitelik Seçimi ile Desteklenen Makine Öğrenmesine Dayalı Göğüs Kanserinin Erken Tespiti ve Teşhisi;Gazi Üniversitesi Fen Bilimleri Dergisi Part C: Tasarım ve Teknoloji;2024-06-29

4. Feature selection SVM through Universum and its applications on text classification Feature selection SVM through Universum;Proceedings of the 2nd International Conference on Signal Processing, Computer Networks and Communications;2023-12-08

5. A novel feature ranking algorithm for text classification: Brilliant probabilistic feature selector (BPFS);Computational Intelligence;2023-08-18

同舟云学术

1.学者识别学者识别

2.学术分析学术分析

3.人才评估人才评估

"同舟云学术"是以全球学者为主线,采集、加工和组织学术论文而形成的新型学术文献查询和分析系统,可以对全球学者进行文献检索和人才价值评估。用户可以通过关注某些学科领域的顶尖人物而持续追踪该领域的学科进展和研究前沿。经过近期的数据扩容,当前同舟云学术共收录了国内外主流学术期刊6万余种,收集的期刊论文及会议论文总量共计约1.5亿篇,并以每天添加12000余篇中外论文的速度递增。我们也可以为用户提供个性化、定制化的学者数据。欢迎来电咨询!咨询电话:010-8811{复制后删除}0370

www.globalauthorid.com

TOP

Copyright © 2019-2024 北京同舟云网络信息技术有限公司
京公网安备11010802033243号  京ICP备18003416号-3