A Text Mining Approach to Uncover the Structure of Subject Metadata in the Biodiversity Heritage Library-Reference-Cited by-同舟云学术

A Text Mining Approach to Uncover the Structure of Subject Metadata in the Biodiversity Heritage Library

Published:2023-10 Issue:1 Volume:60 Page:926-928
ISSN:2373-9231
Container-title:Proceedings of the Association for Information Science and Technology
language:en
Short-container-title:Proceedings of the Association for Information Science and Technology

Author:

Cheng Yi‐Yun¹,Parulian Nikolaus Nova²,Dinh Ly³

Affiliation:

1. School of Communication and Information Rutgers University USA

2. School of Information Sciences University of Illinois Urbana‐Champaign USA

3. School of Information University of South Florida USA

Abstract

ABSTRACTWe propose a bottom‐up, data‐driven pipeline to uncover the structure of biodiversity subject metadata using a combination of text mining approaches. In this study, we analyze 721,035 subject terms in the Biodiversity Heritage Library (BHL). We utilize named entity recognition and word‐embedding methods to systematically label and group terms based on their vector‐space distances. The results show that the subject terms from BHL are clustered into several prominent themes relating to environmental regulations, geographic locations, organisms, and subject access points. We hope that our approach can serve as a first step to group similar subject terms together in large‐scale, constant growing digital collections with aggregated metadata from multiple sources. Ultimately, we hope the next phases of this project can become a basis for biodiversity digital libraries to standardize their vocabularies.

Publisher

Wiley

Subject

Library and Information Sciences,General Computer Science

Reference11 articles.

1. BHL. (2022).Biodiversity Heritage Subject Terms. Retrieved from:https://www.biodiversitylibrary.org/data/subject.txt

2. A preliminary evaluation of hathitrust metadata: Assessing the sufficiency of legacy records

3. The Biodiversity Heritage Library: sharing biodiversity literature with the world

4. Subject Access Points in Electronic Retrieval;Hjorland B.;Annual Review of Information Science and Technology (ARIST),2001

5. Knowledge Organization (KO)