Abstract
The information retrieval system contains either a list of subject terms (taxonomy) or a list of collaborative tags (folksonomy) or both. The taxonomy and folksonomy come together as called hybrid subject devices. The main purpose of this paper is to apply machine learning techniques in the dataset from the library domain like others and analyse a large quantity of data for critical problems with accuracy. This research reveals to perform EDA (Exploratory data analysis), prediction analysis, and similarity measurement between folksonomy and taxonomy terms with new emerging technologies. Data science deals with big data that means unstructured data, messy data, a large volume of data. The size is of a large amount of data in terms of GB, TB. Machine learning tools manage this type of data. Usually, the Excel, or other spreadsheets package could not manage the file size in GB or TB, and that’s why ML tools, and techniques are applied. At present, the library science domain also contains a large amount of data like 20/30 years of circulation data or subject descriptors, collaborative tags etc. Library professionals can apply machine learning tools for analysing this kind of data in the library domain. In this paper, the authors have introduced the applications of tools and techniques in the library domain and they have tested with 2642 taxonomy and folksonomy terms. This research work includes – EDA, prediction analysis, and similarity measurement of a folksonomy and taxonomy dataset. In the EDA part, the research work has performed a lot of analysis that includes frequency of LCSH (Library of Congress Subject Heading - taxonomy) terms, pair plots, joint plots, and heat map of LCSH and folksonomy terms. The logistic regression (LR) model for prediction analysis has been used in the folksonomy and taxonomy dataset. These 2642 terms of folksonomy and taxonomy both terms are taken as data for this research work. The EDA has been performed with the attributes in the dataset. The accuracy value of logistic regression (f1- score) is 0.37 at the training percentage of 69. The percentage of similarity between LCSH terms and folksonomy terms is 30 per cent (0.30151134), and the angle between these two vectors is 27 degrees. The novelty of this research work is that library data has been analysed using machine learning techniques the ever used before.
Publisher
Defence Scientific Information and Documentation Centre
Subject
Library and Information Sciences