Hierarchy-based semantic embeddings for single-valued & multi-valued categorical variables-Reference-Cited by-同舟云学术

Hierarchy-based semantic embeddings for single-valued & multi-valued categorical variables

Published:2021-12-28 Issue: Volume: Page:
ISSN:0925-9902
Container-title:Journal of Intelligent Information Systems
language:en
Short-container-title:J Intell Inf Syst

Author:

Mumtaz Summaya^ORCID,Giese Martin

Abstract

AbstractIn low-resource domains, it is challenging to achieve good performance using existing machine learning methods due to a lack of training data and mixed data types (numeric and categorical). In particular, categorical variables with high cardinality pose a challenge to machine learning tasks such as classification and regression because training requires sufficiently many data points for the possible values of each variable. Since interpolation is not possible, nothing can be learned for values not seen in the training set. This paper presents a method that uses prior knowledge of the application domain to support machine learning in cases with insufficient data. We propose to address this challenge by using embeddings for categorical variables that are based on an explicit representation of domain knowledge (KR), namely a hierarchy of concepts. Our approach is to 1. define a semantic similarity measure between categories, based on the hierarchy—we propose a purely hierarchy-based measure, but other similarity measures from the literature can be used—and 2. use that similarity measure to define a modified one-hot encoding. We propose two embedding schemes for single-valued and multi-valued categorical data. We perform experiments on three different use cases. We first compare existing similarity approaches with our approach on a word pair similarity use case. This is followed by creating word embeddings using different similarity approaches. A comparison with existing methods such as Google, Word2Vec and GloVe embeddings on several benchmarks shows better performance on concept categorisation tasks when using knowledge-based embeddings. The third use case uses a medical dataset to compare the performance of semantic-based embeddings and standard binary encodings. Significant improvement in performance of the downstream classification tasks is achieved by using semantic information.

Funder

Norges Forskningsråd

University of Oslo

Publisher

Springer Science and Business Media LLC

Subject

Artificial Intelligence,Computer Networks and Communications,Hardware and Architecture,Information Systems,Software

Link

https://link.springer.com/content/pdf/10.1007/s10844-021-00693-2.pdf

Reference58 articles.

1. Ahmad, A., & Dey, L. (2007). A method to compute distance between two categorical values of same attribute in unsupervised learning for categorical data set. Pattern Recognition Letters, 28, 110–118.

2. Almuhareb, A. (2006). Attributes in lexical acquisition. Ph.D. thesis, University of Essex.

3. Baroni, M., & Lenci, A. (2011). How we BLESSed distributional semantic evaluation. In Roceedings of the GEMS 2011 workshop on GEometrical models of natural language semantics (pp. 1–10). Association for computational linguistics.

4. Baroni, M., Murphy, B., Barbu, E., & Poesio, M. (2010). Strudel: A corpus-based semantic model based on properties and types. Cognitive Science, 34(2), 222–254. https://doi.org/10.1111/j.1551-6709.2009.01068.x.

5. Bazan, J.G. (2008). Hierarchical classifiers for complex spatio-temporal concepts. In Transactions on Rough Sets IX (pp. 474–750). Berlin: Springer. https://doi.org/10.1007/978-3-540-89876-4_26.

Cited by 9 articles. 订阅此论文施引文献订阅此论文施引文献，注册后可以免费订阅5篇论文的施引文献，订阅后可以查看论文全部施引文献

1. Nonconvex fusion penalties for high-dimensional hierarchical categorical variables;Information Sciences;2024-10

2. Global prevalence of microplastics in tap water systems: Abundance, characteristics, drivers and knowledge gaps;Science of The Total Environment;2024-06

3. Time Is Ripe for Targeting Per- and Polyfluoroalkyl Substances-Induced Hormesis: Global Aquatic Hotspots and Implications for Ecological Risk Assessment;Environmental Science & Technology;2024-05-06

4. Road Crash Injury Severity Prediction Using a Graph Neural Network Framework;IEEE Access;2024

5. Evaluation of Synthetic Categorical Data Generation Techniques for Predicting Cardiovascular Diseases and Post-Hoc Interpretability of the Risk Factors;Applied Sciences;2023-03-23