Constructing outlier-free histograms with variable bin-width based on distance minimization

Author:

Fushimi Takayasu1,Saito Kazumi23,Motoda Hiroshi4

Affiliation:

1. School of Computer Science, Tokyo University of Technology, Tokyo, Japan

2. Faculty of Science, Kanagawa University, Kanagawa, Japan

3. Center for Advanced Intelligence Project, RIKEN, Tokyo, Japan

4. Institute of Scientific and Industrial Research, Osaka University, Osaka, Japan

Abstract

We propose a new method of constructing a variable bin width histogram that can accommodate the unbalanced distribution of the samples yet retaining, as a whole, the good aspect of both equal width (EW) and equal-area (EA) histograms that are being used popularly for data visualization and analysis. We formulate this as an optimal change point detection problem in which the bin boundaries are determined by minimizing the sum of the absolute error or the squared error in each bin. The former is based on Distance Minimization (DM) and new, and the latter is based on Variance Minimization (VM) and is considered the state-of-the-art. The constructed histograms can effectively be used to detect and visualize hidden outliers/anomalies by applying the interquartile range method in each bin. The final histograms are obtained by adjusting bin boundaries and heights accordingly after removing the detected outliers/anomalies. We further propose a method to annotate the constructed bins if the data for annotation is given for each sample as a set of nominal variables, using z-score with respect to their distribution within each bin. We applied our method to both real vinyl greenhouse datasets and two different sets of three synthetic datasets, and confirmed that both DM and VM methods work as intended, both can represent the sample distribution with a smaller number of bins than those by EW and EA methods, The use of interquartile range method can detect anomalies as well as outliers, and the terms selected for annotation are interpretable and reasonable. EW and EA methods have contrasting properties. DM and VM methods lie in between, but the former is closer to EA method and the latter to EW method. DM method runs substantially faster than VM method and performs slightly better than VM method in outlier detection and annotation tasks.

Publisher

IOS Press

Subject

Artificial Intelligence,Computer Vision and Pattern Recognition,Theoretical Computer Science

Cited by 1 articles. 订阅此论文施引文献 订阅此论文施引文献,注册后可以免费订阅5篇论文的施引文献,订阅后可以查看论文全部施引文献

同舟云学术

1.学者识别学者识别

2.学术分析学术分析

3.人才评估人才评估

"同舟云学术"是以全球学者为主线,采集、加工和组织学术论文而形成的新型学术文献查询和分析系统,可以对全球学者进行文献检索和人才价值评估。用户可以通过关注某些学科领域的顶尖人物而持续追踪该领域的学科进展和研究前沿。经过近期的数据扩容,当前同舟云学术共收录了国内外主流学术期刊6万余种,收集的期刊论文及会议论文总量共计约1.5亿篇,并以每天添加12000余篇中外论文的速度递增。我们也可以为用户提供个性化、定制化的学者数据。欢迎来电咨询!咨询电话:010-8811{复制后删除}0370

www.globalauthorid.com

TOP

Copyright © 2019-2024 北京同舟云网络信息技术有限公司
京公网安备11010802033243号  京ICP备18003416号-3