Statistical approach to normalization of feature vectors and clustering of mixed datasets-Reference-Cited by-同舟云学术

Statistical approach to normalization of feature vectors and clustering of mixed datasets

Published:2012-04-18 Issue:2145 Volume:468 Page:2630-2651
ISSN:1364-5021
Container-title:Proceedings of the Royal Society A: Mathematical, Physical and Engineering Sciences
language:en
Short-container-title:Proc. R. Soc. A.

Author:

Suarez-Alvarez Maria M.¹,Pham Duc-Truong²,Prostov Mikhail Y.³,Prostov Yuriy I.⁴

Affiliation:

1. School of Engineering, Cardiff University, Cardiff CF24 0AA, UK

2. School of Mechanical Engineering, University of Birmingham, Birmingham B15 2TT, UK

3. Faculty of Mechanics and Mathematics, Moscow State University, Moscow 119991, Russia

4. Department of Higher Mathematics, Moscow Institute of Radio Engineering, Electronics and Automation, Technical University, 78 Vernadskogo pr., Moscow 117454, Russia

Abstract

Normalization of feature vectors of datasets is widely used in a number of fields of data mining, in particular in cluster analysis, where it is used to prevent features with large numerical values from dominating in distance-based objective functions. In this study, a unified statistical approach to normalization of all attributes of mixed databases, when different metrics are used for numerical and categorical data, is proposed. After the proposed normalization, the contributions of both numerical and categorical attributes to a specified objective function are statistically the same. Formulae for the statistically normalized Minkowski mixed p -metrics are given in an explicit way. It is shown that the classic z -score standardization and the min–max normalization are particular cases of the statistical normalization, when the objective function is, respectively, based on the Euclidean or the Tchebycheff (Chebyshev) metrics. Finally, clustering of several benchmark datasets is performed with non-normalized and introduced normalized mixed metrics using either the k -prototypes (for p =2) or another algorithm (for p ≠2).

Publisher

The Royal Society

Subject

General Physics and Astronomy,General Engineering,General Mathematics

Link

https://royalsocietypublishing.org/doi/pdf/10.1098/rspa.2011.0704

Reference36 articles.

1. Feature normalization and likelihood-based similarity measures for image retrieval

2. Minkowski metric, feature weighting and anomalous cluster initializing in K-Means clustering

3. Asuncion A.& Newman D. J.. 2007 UCI Machine learning repositor. University of California CA:School of Information and Computer Science. See http://www.ics.uci.edu/mlearn/MLRepository.html.

4. A Convergence Theorem for the Fuzzy ISODATA Clustering Algorithms

Cited by 79 articles. 订阅此论文施引文献订阅此论文施引文献，注册后可以免费订阅5篇论文的施引文献，订阅后可以查看论文全部施引文献

1. Global low carbon transitions in the power sector: A machine learning clustering approach using archetypes;Journal of Economy and Technology;2024-11

2. Surrogate modeling for the long-term behavior of PC bridges via FEM analyses and long short-term neural networks;Structures;2024-05

3. Generation of meter-scale nanosecond pulsed DBD and the intelligent evaluation based on multi-dimensional feature parameter extraction;Journal of Physics D: Applied Physics;2024-04-11

4. Structural characterization of DNA amplicons by ATR-FTIR spectroscopy as a guide for screening metainflammatory disorders in blood plasma;Spectrochimica Acta Part A: Molecular and Biomolecular Spectroscopy;2024-04

5. Governing Sea Level Rise in a Polycentric System;2024-03-30