Zgli: A Pipeline for Clustering by Compression with Application to Patient Stratification in Spondyloarthritis-Reference-Cited by-同舟云学术

Zgli: A Pipeline for Clustering by Compression with Application to Patient Stratification in Spondyloarthritis

Published:2023-01-20 Issue:3 Volume:23 Page:1219
ISSN:1424-8220
Container-title:Sensors
language:en
Short-container-title:Sensors

Author:

Azevedo Diogo¹,Rodrigues Ana Maria²³^ORCID,Canhão Helena²³^ORCID,Carvalho Alexandra M.⁴⁵⁶^ORCID,Souto André¹⁴^ORCID

Affiliation:

1. LASIGE, Departamento de Informática da Faculdade de Ciências, Universidade de Lisboa, 1749-016 Lisboa, Portugal

2. EpiDoC Unit, The Chronic Diseases Research Centre, NOVA Medical School, NOVA University of Lisbon, 1169-056 Lisboa, Portugal

3. Comprehensive Health Research Center, NOVA Medical School, NOVA University of Lisbon, 1150-082 Lisboa, Portugal

4. Instituto de Telecomunicações, 1049-001 Lisboa, Portugal

5. Department of Electrical and Computer Engineering, Instituto Superior Técnico, Universidade de Lisboa, 1049-001 Lisboa, Portugal

6. Lisbon Unit for Learning and Intelligent Systems, 1049-001 Lisboa, Portugal

Abstract

The normalized compression distance (NCD) is a similarity measure between a pair of finite objects based on compression. Clustering methods usually use distances (e.g., Euclidean distance, Manhattan distance) to measure the similarity between objects. The NCD is yet another distance with particular characteristics that can be used to build the starting distance matrix for methods such as hierarchical clustering or K-medoids. In this work, we propose Zgli, a novel Python module that enables the user to compute the NCD between files inside a given folder. Inspired by the CompLearn Linux command line tool, this module iterates on it by providing new text file compressors, a new compression-by-column option for tabular data, such as CSV files, and an encoder for small files made up of categorical data. Our results demonstrate that compression by column can yield better results than previous methods in the literature when clustering tabular data. Additionally, the categorical encoder shows that it can augment categorical data, allowing the use of the NCD for new data types. One of the advantages is that using this new feature does not require knowledge or context of the data. Furthermore, the fact that the new proposed module is written in Python, one of the most popular programming languages for machine learning, potentiates its use by developers to tackle problems with a new approach based on compression. This pipeline was tested in clinical data and proved a promising computational strategy by providing patient stratification via clusters aiding in precision medicine.

Funder

Fundação para a Ciência e Tecnologia

Instituto de Telecomunicações Research Unit

Fundo Europeu de Desenvolvimento Regional

Programa Operacional Regional LISBOA

Publisher

MDPI AG

Subject

Electrical and Electronic Engineering,Biochemistry,Instrumentation,Atomic and Molecular Physics, and Optics,Analytical Chemistry

Link

https://www.mdpi.com/1424-8220/23/3/1219/pdf

Reference39 articles.