Regularized target encoding outperforms traditional methods in supervised machine learning with high cardinality features-Reference-Cited by-同舟云学术

Regularized target encoding outperforms traditional methods in supervised machine learning with high cardinality features

Published:2022-03-04 Issue:5 Volume:37 Page:2671-2692
ISSN:0943-4062
Container-title:Computational Statistics
language:en
Short-container-title:Comput Stat

Author:

Pargent Florian^ORCID,Pfisterer Florian^ORCID,Thomas Janek^ORCID,Bischl Bernd^ORCID

Abstract

AbstractSince most machine learning (ML) algorithms are designed for numerical inputs, efficiently encoding categorical variables is a crucial aspect in data analysis. A common problem are high cardinality features, i.e. unordered categorical predictor variables with a high number of levels. We study techniques that yield numeric representations of categorical variables which can then be used in subsequent ML applications. We focus on the impact of these techniques on a subsequent algorithm’s predictive performance, and—if possible—derive best practices on when to use which technique. We conducted a large-scale benchmark experiment, where we compared different encoding strategies together with five ML algorithms (lasso, random forest, gradient boosting, k-nearest neighbors, support vector machine) using datasets from regression, binary- and multiclass–classification settings. In our study, regularized versions of target encoding (i.e. using target predictions based on the feature levels in the training set as a new numerical feature) consistently provided the best results. Traditionally widely used encodings that make unreasonable assumptions to map levels to integers (e.g. integer encoding) or to reduce the number of levels (possibly based on target information, e.g. leaf encoding) before creating binary indicator variables (one-hot or dummy encoding) were not as effective in comparison.

Funder

Bundesministerium für Bildung, Wissenschaft und Kultur

Bayerisches Staatsministerium für Wirtschaft und Medien, Energie und Technologie

Publisher

Springer Science and Business Media LLC

Subject

Computational Mathematics,Statistics, Probability and Uncertainty,Statistics and Probability

Link

https://link.springer.com/content/pdf/10.1007/s00180-022-01207-6.pdf

Reference50 articles.

1. Bates D (2020) Computational methods for mixed models. Vignette for lme4. https://cran.r-project.org/web/packages/lme4/vignettes/Theory.pdf

2. Bates D, Mächler M, Bolker B, Walker S (2015) Fitting linear mixed-effects models using lme4. J Stat Softw 67:1–48. https://doi.org/10.18637/jss.v067.i01

3. Binder M (2018) mlrCPO: Composable preprocessing operators and pipelines for machine learning. R package version 0.3.4-2. https://github.com/mlr-org/mlrCPO

4. Bischl B, Lang M, Kotthoff L, Schiffner J, Richter J, Studerus E, Casalicchio G, Jones ZM (2016) mlr: machine learning in r. J Mach Learn Res 17:1–5

5. Bommert A, Sun X, Bischl B, Rahnenführer J, Lang M (2020) Benchmark for filter methods for feature selection in high-dimensional classification data. Comput Stat Data Anal. https://doi.org/10.1016/j.csda.2019.106839

Cited by 51 articles. 订阅此论文施引文献订阅此论文施引文献，注册后可以免费订阅5篇论文的施引文献，订阅后可以查看论文全部施引文献

1. Precise prediction of CO2 separation performance of metal–organic framework mixed matrix membranes based on feature selection and machine learning;Separation and Purification Technology;2024-12

2. P2P credit risk management with KG-GNN: a knowledge graph and graph neural network-based approach;Journal of the Operational Research Society;2024-09-14

3. Adoption of Machine Learning Methods for Crop Yield Prediction-based Smart Agriculture and Sustainable Growth of Crop Yield Production – Case Study in Jordan;2024-09-05

4. Using random forest to improve EMEP4PL model estimates of daily PM2.5 in Poland;Atmospheric Environment;2024-09

5. A high-throughput workflow to analyze sequence-conformation relationships and explore hydrophobic patterning in disordered peptoids;Chem;2024-09