Statistical Evaluation of Categorical Encoders for Pattern Preservation in Machine Learning Tasks
-
Published:2024-06-12
Issue:2
Volume:15
Page:160-172
-
ISSN:2007-1558
-
Container-title:International Journal of Combinatorial Optimization Problems and Informatics
-
language:
-
Short-container-title:Int. Journal of COP and Infor.
Author:
Valdez-Valenzuela Eric,Kuri-Morales Angel,Gomez-Adorno Helena
Abstract
Categorical attributes are prevalent in many datasets used for training Machine Learning models. However, most ML models are designed to handle only numerical inputs. Therefore, converting these categorical attributes into numerical values is necessary to utilize them effectively. During this conversion process, it is essential to preserve the underlying patterns. A loss of such information could adversely affect the performance of ML algorithms. Several encoding techniques have been developed to map categorical instances to numbers. This study evaluates commonly used encoders alongside CESAMO, a novel encoder designed to capture relationships between categorical attributes and other variables using what is referred to as Pattern Preserving Codes. We conducted a statistically supported assessment of these categorical encoders using synthetic data and compared the encoders’ performance. The results show that CESAMO outperforms all other evaluated encoding techniques, confirming its ability to identify patterns in categorical data effectively.
Publisher
Editorial Académica Dragón Azteca