A preprocessing scheme for high-cardinality categorical attributes in classification and prediction problems-Reference-Cited by-同舟云学术

A preprocessing scheme for high-cardinality categorical attributes in classification and prediction problems

Published:2001-07 Issue:1 Volume:3 Page:27-32
ISSN:1931-0145
Container-title:ACM SIGKDD Explorations Newsletter
language:en
Short-container-title:SIGKDD Explor. Newsl.

Author:

Micci-Barreca Daniele¹

Affiliation:

1. ClearCommerce Corporation, Austin, TX

Abstract

Categorical data fields characterized by a large number of distinct values represent a serious challenge for many classification and regression algorithms that require numerical inputs. On the other hand, these types of data fields are quite common in real-world data mining applications and often contain potentially relevant information that is difficult to represent for modeling purposes.This paper presents a simple preprocessing scheme for high-cardinality categorical data that allows this class of attributes to be used in predictive models such as neural networks, linear and logistic regression. The proposed method is based on a well-established statistical method (empirical Bayes) that is straightforward to implement as an in-database procedure. Furthermore, for categorical attributes with an inherent hierarchical structure, like ZIP codes, the preprocessing scheme can directly leverage the hierarchy by blending statistics at the various levels of aggregation.While the statistical methods discussed in this paper were first introduced in the mid 1950's, the use of these methods as a preprocessing step for complex models, like neural networks, has not been previously discussed in any literature.

Publisher

Association for Computing Machinery (ACM)

Link

https://dl.acm.org/doi/pdf/10.1145/507533.507538

Reference13 articles.

1. Automating exploratory data analysis for efficient data mining

2. Gnanadesikan R. Methods for Statistical Data Analysis of Multivariate Observations Wiley New York 1977 Gnanadesikan R. Methods for Statistical Data Analysis of Multivariate Observations Wiley New York 1977

Cited by 134 articles. 订阅此论文施引文献订阅此论文施引文献，注册后可以免费订阅5篇论文的施引文献，订阅后可以查看论文全部施引文献

1. Enabling CMF estimation in data-constrained scenarios: A semantic-encoding knowledge mining model;Accident Analysis & Prevention;2024-09

2. Mean Block Size Prediction in Rock Blast Fragmentation Using TPE-Tree-Based Model Approach with SHapley Additive exPlanations;Mining, Metallurgy & Exploration;2024-08-08

3. Predicting onward care needs at admission to reduce discharge delay using machine learning;2024-08-07

4. Classification of architectural and MEP BIM objects for building performance evaluation;Advanced Engineering Informatics;2024-08

5. Imbalanced rock burst assessment using variational autoencoder-enhanced gradient boosting algorithms and explainability;Underground Space;2024-08