Recursive computed ABC (cABC) analysis as a precise method for reducing machine learning based feature sets to their minimum informative size

Author:

Lötsch Jörn1,Ultsch Alfred2

Affiliation:

1. Goethe - University

2. University of Marburg

Abstract

Abstract Background Selecting the k best features is a common task in machine-learning. Typically, a few variables have high importance, but many have low importance (right skewed distribution). This report proposes a numerically precise method to address this skewed feature importance distribution to reduce a feature set to the informative minimum of items. Methods Computed ABC analysis (cABC) is an item categorization method that aims to identify the most important elements by dividing a set of non-negative numerical elements into subsets "A", "B" and "C" such that subset "A" contains the "few important " items based on specific properties of ABC curves defined by their relationship to Lorenz curves. In its recursive form, the cABC analysis can be applied again to subset "A". A generic image data set and three biomedical datasets (lipidomics and two genomics datasets) with a large number of variables were used to perform the experiments. Results Experimental results show that recursive cABC analysis limits dimensions of data projection to a minimum where the relevant information is still preserved and directs feature selection in machine learning to the most important class-relevant information including filtering feature sets for nonsense variables. Feature sets were reduced to 10% or less of the original variables and still provided accurate classification in data unused for feature selection. Conclusions cABC analysis, in its recursive variant, provides a computational precise defined means of reducing information to a minimum. The minimum is the result of a computation of the number of k most relevant items rather than of a decision to select the k best items from a list. Furthermore, precise criteria for stopping the reduction process are available. The reduction to the most important features can increase human comprehension of the properties of the data set. The cABC method is implemented in the Python package "cABCanalysis" available at https://pypi.org/project/cABCanalysis/.

Publisher

Research Square Platform LLC

同舟云学术

1.学者识别学者识别

2.学术分析学术分析

3.人才评估人才评估

"同舟云学术"是以全球学者为主线,采集、加工和组织学术论文而形成的新型学术文献查询和分析系统,可以对全球学者进行文献检索和人才价值评估。用户可以通过关注某些学科领域的顶尖人物而持续追踪该领域的学科进展和研究前沿。经过近期的数据扩容,当前同舟云学术共收录了国内外主流学术期刊6万余种,收集的期刊论文及会议论文总量共计约1.5亿篇,并以每天添加12000余篇中外论文的速度递增。我们也可以为用户提供个性化、定制化的学者数据。欢迎来电咨询!咨询电话:010-8811{复制后删除}0370

www.globalauthorid.com

TOP

Copyright © 2019-2024 北京同舟云网络信息技术有限公司
京公网安备11010802033243号  京ICP备18003416号-3