Exploiting domain knowledge to address class imbalance and a heterogeneous feature space in multi-class classification
-
Published:2023-02-27
Issue:5
Volume:32
Page:1037-1064
-
ISSN:1066-8888
-
Container-title:The VLDB Journal
-
language:en
-
Short-container-title:The VLDB Journal
Author:
Hirsch Vitali,Reimann Peter,Treder-Tschechlov Dennis,Schwarz Holger,Mitschang Bernhard
Abstract
AbstractReal-world data of multi-class classification tasks often show complex data characteristics that lead to a reduced classification performance. Major analytical challenges are a high degree of multi-class imbalance within data and a heterogeneous feature space, which increases the number and complexity of class patterns. Existing solutions to classification or data pre-processing only address one of these two challenges in isolation. We propose a novel classification approach that explicitly addresses both challenges of multi-class imbalance and heterogeneous feature space together. As main contribution, this approach exploits domain knowledge in terms of a taxonomy to systematically prepare the training data. Based on an experimental evaluation on both real-world data and several synthetically generated data sets, we show that our approach outperforms any other classification technique in terms of accuracy. Furthermore, it entails considerable practical benefits in real-world use cases, e.g., it reduces rework required in the area of product quality control.
Funder
Deutsche Forschungsgemeinschaft Ministerium für Wissenschaft, Forschung und Kunst Baden-Württemberg
Publisher
Springer Science and Business Media LLC
Subject
Hardware and Architecture,Information Systems
Reference56 articles.
1. Agard, B., Kusiak, A.: Data-mining-based methodology for the design of product families. Int. J. Prod. Res. 42(15), 2955–2969 (2004). https://doi.org/10.1080/00207540410001691929 2. Akhand, M.A.H., Murase, K.: Neural network ensemble training by sequential interaction. In: Proceedings of the 17th International Conference on Artificial Neural Networks, LNCS, pp. 98–108. Springer, Porto, Portugal (2007). https://doi.org/10.1007/978-3-540-74690-4_11 3. Alcalá-Fdez, J., Fernández, A., Luengo, J., Derrac, J., García, S.: KEEL data-mining software tool: data set repository, integration of algorithms and experimental analysis framework. Multiple-Valued Logic Soft Comput. 17(2–3), 255–287 (2011) 4. Bach, S.H., Rodriguez, D., Liu, Y., Luo, C., Shao, H., Xia, C., Sen, S., Ratner, A., Hancock, B., Alborzi, H., Kuchhal, R., Ré, C., Malkin, R.: Snorkel Drybell: A case study in deploying weak supervision at industrial scale. In: Proceedings of the 2019 International Conference on Management of Data (SIGMOD), pp. 362–375. Amsterdam, The Netherlands (2019). https://doi.org/10.1145/3299869.3314036 5. Baggio, G., Corsini, A., Floreani, A., Giannini, S., Zagonel, V.: Gender medicine: a task for the third millennium. Clin Chem Lab Med 51(4), 713–727 (2013). https://doi.org/10.1515/cclm-2012-0849
Cited by
2 articles.
订阅此论文施引文献
订阅此论文施引文献,注册后可以免费订阅5篇论文的施引文献,订阅后可以查看论文全部施引文献
|
|