Training Data Augmentation with Data Distilled by Principal Component Analysis

Author:

Sirakov Nikolay Metodiev1ORCID,Shahnewaz Tahsin1,Nakhmani Arie2ORCID

Affiliation:

1. Department of Mathematics, Texas A & M University-Commerce, Commerce, TX 75429, USA

2. Department of Electrical and Computer Engineering, University of Alabama at Birmingham, Birmingham, AL 35294, USA

Abstract

This work develops a new method for vector data augmentation. The proposed method applies principal component analysis (PCA), determines the eigenvectors of a set of training vectors for a machine learning (ML) method and uses them to generate the distilled vectors. The training and PCA-distilled vectors have the same dimension. The user chooses the number of vectors to be distilled and augmented to the set of training vectors. A statistical approach determines the lowest number of vectors to be distilled such that when augmented to the original vectors, the extended set trains an ML classifier to achieve a required accuracy. Hence, the novelty of this study is the distillation of vectors with the PCA method and their use to augment the original set of vectors. The advantage that comes from the novelty is that it increases the statistics of ML classifiers. To validate the advantage, we conducted experiments with four public databases and applied four classifiers: a neural network, logistic regression and support vector machine with linear and polynomial kernels. For the purpose of augmentation, we conducted several distillations, including nested distillation (double distillation). The latter notion means that new vectors were distilled from already distilled vectors. We trained the classifiers with three sets of vectors: the original vectors, original vectors augmented with vectors distilled by PCA and original vectors augmented with distilled PCA vectors and double distilled by PCA vectors. The experimental results are presented in the paper, and they confirm the advantage of the PCA-distilled vectors increasing the classification statistics of ML methods if the distilled vectors augment the original training vectors.

Funder

National Institutes of Health

Publisher

MDPI AG

Reference34 articles.

1. A survey on addressing high-class imbalance in big data;Leevy;J. Big Data,2018

2. An insight into classification with imbalanced data: Empirical results and current trends on using data intrinsic characteristics;Victoria;Inf. Sci.,2013

3. Qiong, G., Cai, Z., Zhu, L., and Huang, B. (2008, January 20–22). Data mining on imbalanced data sets. Proceedings of the 2008 International Conference on Advanced Computer Theory and Engineering, Washington, DC, USA.

4. International Skin Imaging Collaboration (2023, May 01). SIIM-ISIC 2020 Challenge Dataset. Available online: https://challenge2020.isic-archive.com/.

5. Wang, B., and Klabjan, D. (2016). Regularization for Unsupervised Deep Neural Nets. arXiv.

同舟云学术

1.学者识别学者识别

2.学术分析学术分析

3.人才评估人才评估

"同舟云学术"是以全球学者为主线,采集、加工和组织学术论文而形成的新型学术文献查询和分析系统,可以对全球学者进行文献检索和人才价值评估。用户可以通过关注某些学科领域的顶尖人物而持续追踪该领域的学科进展和研究前沿。经过近期的数据扩容,当前同舟云学术共收录了国内外主流学术期刊6万余种,收集的期刊论文及会议论文总量共计约1.5亿篇,并以每天添加12000余篇中外论文的速度递增。我们也可以为用户提供个性化、定制化的学者数据。欢迎来电咨询!咨询电话:010-8811{复制后删除}0370

www.globalauthorid.com

TOP

Copyright © 2019-2024 北京同舟云网络信息技术有限公司
京公网安备11010802033243号  京ICP备18003416号-3