Sanitized clustering against confounding bias-Reference-Cited by-同舟云学术

Sanitized clustering against confounding bias

Published:2023-12-27 Issue: Volume: Page:
ISSN:0885-6125
Container-title:Machine Learning
language:en
Short-container-title:Mach Learn

Author:

Yao Yinghua^ORCID,Pan Yuangang,Li Jing,Tsang Ivor W.,Yao Xin

Abstract

AbstractReal-world datasets inevitably contain biases that arise from different sources or conditions during data collection. Consequently, such inconsistency itself acts as a confounding factor that disturbs the cluster analysis. Existing methods eliminate the biases by projecting data onto the orthogonal complement of the subspace expanded by the confounding factor before clustering. Therein, the interested clustering factor and the confounding factor are coarsely considered in the raw feature space, where the correlation between the data and the confounding factor is ideally assumed to be linear for convenient solutions. These approaches are thus limited in scope as the data in real applications is usually complex and non-linearly correlated with the confounding factor. This paper presents a new clustering framework named Sanitized Clustering Against confounding Bias, which removes the confounding factor in the semantic latent space of complex data through a non-linear dependence measure. To be specific, we eliminate the bias information in the latent space by minimizing the mutual information between the confounding factor and the latent representation delivered by variational auto-encoder. Meanwhile, a clustering module is introduced to cluster over the purified latent representations. Extensive experiments on complex datasets demonstrate that our SCAB achieves a significant gain in clustering performance by removing the confounding bias.

Funder

A*STAR Centre for Frontier AI Research

Program for Guangdong Introducing Innovative and Entrepreneurial Teams

Program for Guangdong Provincial Key Laboratory

Publisher

Springer Science and Business Media LLC

Subject

Artificial Intelligence,Software

Link

https://link.springer.com/content/pdf/10.1007/s10994-023-06451-5.pdf

Reference43 articles.

1. Alemi A. A., Fischer I., Dillon J. V., et al. (2017). Deep variational information bottleneck. In: ICLR

2. Anguita, D., Ghio, A., Oneto, L., et al. (2013). A public domain dataset for human activity recognition using smartphones. 21th European symposium on artificial neural networks (pp. 437–442). CIACO: Computational Intelligence and Machine Learning (ESANN).

3. Bay, S. D., Kibler, D. F., Pazzani, M. J., et al. (2000). The UCI KDD archive of large data sets for data mining research and experimentation. ACM SIGKDD Explorations Newsletter, 2(2), 81–85.

4. Benito, M., Parker, J., Du, Q., et al. (2004). Adjustment of systematic microarray data biases. Bioinformatics, 20(1), 105–114.

5. Bishop, C. M. (2006). Pattern recognition and machine learning, (Vol. 4). Springer.

Cited by 1 articles. 订阅此论文施引文献订阅此论文施引文献，注册后可以免费订阅5篇论文的施引文献，订阅后可以查看论文全部施引文献

1. LSVC: A Lifelong Learning Approach for Stream-View Clustering;IEEE Transactions on Neural Networks and Learning Systems;2024