TTC-3600: A new benchmark dataset for Turkish text categorization-Reference-Cited by-同舟云学术

TTC-3600: A new benchmark dataset for Turkish text categorization

Published:2015-12-01 Issue:2 Volume:43 Page:174-185
ISSN:0165-5515
Container-title:Journal of Information Science
language:en
Short-container-title:Journal of Information Science

Author:

Kılınç Deniz¹,Özçift Akın¹,Bozyigit Fatma¹,Yıldırım Pelin¹,Yücalar Fatih¹,Borandag Emin¹

Affiliation:

1. Faculty of Technology, Celal Bayar University, Turkey

Abstract

Owing to the rapid growth of the World Wide Web, the number of documents that can be accessed via the Internet explosively increases with each passing day. Considering news portals in particular, sometimes documents related to categories such as technology, sports and politics seem to be in the wrong category or documents are located in a generic category called others. At this point, text categorization (TC), which is generally addressed as a supervised learning task is needed. Although there are substantial number of studies conducted on TC in other languages, the number of studies conducted in Turkish is very limited owing to the lack of accessibility and usability of datasets created. In this paper, a new dataset named TTC-3600, which can be widely used in studies of TC of Turkish news and articles, is created. TTC-3600 is a well-documented dataset and its file formats are compatible with well-known text mining tools. Five widely used classifiers within the field of TC and two feature selection methods are evaluated on TTC-3600. The experimental results indicate that the best accuracy criterion value 91.03% is obtained with the combination of Random Forest classifier and attribute ranking-based feature selection method in all comparisons performed after pre-processing and feature selection steps. The publicly available TTC-3600 dataset and the experimental results of this study can be utilized in comparative experiments by other researchers.

Publisher

SAGE Publications

Subject

Library and Information Sciences,Information Systems

Link

http://journals.sagepub.com/doi/pdf/10.1177/0165551515620551

Reference42 articles.

1. The contribution of data mining to information science

2. A Systematic Comparison of Supervised Classifiers

3. Machine learning in automated text categorization

4. Improved Feature-Selection Method Considering the Imbalance Problem in Text Categorization

Cited by 54 articles. 订阅此论文施引文献订阅此论文施引文献，注册后可以免费订阅5篇论文的施引文献，订阅后可以查看论文全部施引文献

1. Relational Turkish Text Classification Using Distant Supervised Entities and Relations;Computers, Materials & Continua;2024

2. Feature selection based on long short term memory for text classification;Multimedia Tools and Applications;2023-10-18

3. Strategies for enhancing the performance of news article classification in Bangla: Handling imbalance and interpretation;Engineering Applications of Artificial Intelligence;2023-10

4. A Turkish Text Classification Based Feature Selection and Density Peaks Clustering;2023 31st Signal Processing and Communications Applications Conference (SIU);2023-07-05

5. Filter feature selection methods for text classification: a review;Multimedia Tools and Applications;2023-05-11