Automatic Classification of Swedish Metadata Using Dewey Decimal Classification: A Comparison of Approaches

Author:

Golub Koraljka1,Hagelbäck Johan2,Ardö Anders3

Affiliation:

1. Department of Cultural Sciences, Faculty of Arts and Humanities , Linnaeus University , Växjö , Sweden

2. Department of Computer Science and Media Technology, Faculty of Technology , Linnaeus University , Kalmar , Sweden

3. Department of Electrical and Information Technology , Lund University , Lund , Sweden

Abstract

Abstract Purpose With more and more digital collections of various information resources becoming available, also increasing is the challenge of assigning subject index terms and classes from quality knowledge organization systems. While the ultimate purpose is to understand the value of automatically produced Dewey Decimal Classification (DDC) classes for Swedish digital collections, the paper aims to evaluate the performance of six machine learning algorithms as well as a string-matching algorithm based on characteristics of DDC. Design/methodology/approach State-of-the-art machine learning algorithms require at least 1,000 training examples per class. The complete data set at the time of research involved 143,838 records which had to be reduced to top three hierarchical levels of DDC in order to provide sufficient training data (totaling 802 classes in the training and testing sample, out of 14,413 classes at all levels). Findings Evaluation shows that Support Vector Machine with linear kernel outperforms other machine learning algorithms as well as the string-matching algorithm on average; the string-matching algorithm outperforms machine learning for specific classes when characteristics of DDC are most suitable for the task. Word embeddings combined with different types of neural networks (simple linear network, standard neural network, 1D convolutional neural network, and recurrent neural network) produced worse results than Support Vector Machine, but reach close results, with the benefit of a smaller representation size. Impact of features in machine learning shows that using keywords or combining titles and keywords gives better results than using only titles as input. Stemming only marginally improves the results. Removed stop-words reduced accuracy in most cases, while removing less frequent words increased it marginally. The greatest impact is produced by the number of training examples: 81.90% accuracy on the training set is achieved when at least 1,000 records per class are available in the training set, and 66.13% when too few records (often less than 100 per class) on which to train are available—and these hold only for top 3 hierarchical levels (803 instead of 14,413 classes). Research limitations Having to reduce the number of hierarchical levels to top three levels of DDC because of the lack of training data for all classes, skews the results so that they work in experimental conditions but barely for end users in operational retrieval systems. Practical implications In conclusion, for operative information retrieval systems applying purely automatic DDC does not work, either using machine learning (because of the lack of training data for the large number of DDC classes) or using string-matching algorithm (because DDC characteristics perform well for automatic classification only in a small number of classes). Over time, more training examples may become available, and DDC may be enriched with synonyms in order to enhance accuracy of automatic classification which may also benefit information retrieval performance based on DDC. In order for quality information services to reach the objective of highest possible precision and recall, automatic classification should never be implemented on its own; instead, machine-aided indexing that combines the efficiency of automatic suggestions with quality of human decisions at the final stage should be the way for the future. Originality/value The study explored machine learning on a large classification system of over 14,000 classes which is used in operational information retrieval systems. Due to lack of sufficient training data across the entire set of classes, an approach complementing machine learning, that of string matching, was applied. This combination should be explored further since it provides the potential for real-life applications with large target classification systems.

Publisher

Walter de Gruyter GmbH

Reference26 articles.

1. Aliwy, A.H., & Ameer, E.H.A. (2017). Comparative study of five text classification algorithms with their improvements. International Journal of Applied Engineering Research, 12(14), 4309–4319.

2. Anderson, J., & Perez-Carballo, J. (2001). The nature of indexing: How humans and machines analyze messages and texts for retrieval. Part II: Machine indexing, and the allocation of human versus machine effort. Information Processing and Management 37(2), 255–277.

3. Chen, H., & Dumais, S. (2000). Bringing order to the web: Automatically categorizing search results. In Proceedings of the ACM International Conference on Human Factors in Computing Systems, Den Haag, 145–152.

4. Golub, K. (2006). Automated subject classification of textual web documents. Journal of Documentation, 62(3), 350–371.

5. Golub, K. (2007). Automated subject classification of textual documents in the context of web-based hierarchical browsing: PhD thesis. Lund: Department of Electrical and Information Technology, Lund University.

Cited by 10 articles. 订阅此论文施引文献 订阅此论文施引文献,注册后可以免费订阅5篇论文的施引文献,订阅后可以查看论文全部施引文献

1. B-Wheel – Building AI competences in academic libraries;The Journal of Academic Librarianship;2024-07

2. Indonesian Dewey Decimal Classification System Using Support Vector Machine and Neural Network Algorithms;2023 IEEE 7th International Conference on Information Technology, Information Systems and Electrical Engineering (ICITISEE);2023-11-29

3. Is adopting artificial intelligence in libraries urgency or a buzzword? A systematic literature review;Journal of Information Science;2023-01-12

4. Design of Semiautomatic Digital Creation System for Electronic Music Based on Recurrent Neural Network;Computational Intelligence and Neuroscience;2022-06-27

5. Leaders, practitioners and scientists' awareness of artificial intelligence in libraries: a pilot study;Library Hi Tech;2022-04-04

同舟云学术

1.学者识别学者识别

2.学术分析学术分析

3.人才评估人才评估

"同舟云学术"是以全球学者为主线,采集、加工和组织学术论文而形成的新型学术文献查询和分析系统,可以对全球学者进行文献检索和人才价值评估。用户可以通过关注某些学科领域的顶尖人物而持续追踪该领域的学科进展和研究前沿。经过近期的数据扩容,当前同舟云学术共收录了国内外主流学术期刊6万余种,收集的期刊论文及会议论文总量共计约1.5亿篇,并以每天添加12000余篇中外论文的速度递增。我们也可以为用户提供个性化、定制化的学者数据。欢迎来电咨询!咨询电话:010-8811{复制后删除}0370

www.globalauthorid.com

TOP

Copyright © 2019-2024 北京同舟云网络信息技术有限公司
京公网安备11010802033243号  京ICP备18003416号-3