Abstract
AbstractMachine learnings such as multivariate analyses and clustering have been frequently used for metabolomics data analyses. In metabolomics data analyses, how much difference there is between the results calculated by supervised and unsupervised learning models is an interesting topic. Since metabolomics data include hundreds to thousands of metabolites greater than the sample numbers, only a small fraction of metabolites is relevant to the phenotype of interest. For this reason, sparse mechanisms have been introduced into many machine learning models. However, its explanatory power decreases when the number of explanatory variables is reduced to an extreme level. In this paper, serum lipidomic data of breast cancer patients (1) pre/post-menopause and (2) before/after neoadjuvant chemotherapy was chosen as one of metabolomics data. Here, this data was analyzed by partial least squares (PLS) for regression and K-means and hierarchical clustering for clustering. Results were also compare with the sparse modeling. Between the non-sparse and sparse modeling accuracy, there is no significant difference. Metabolite subsets selected by sparse modeling were almost identical to the PLS-selected features. At the same time, several metabolites were consistently selected regardless of the algorithm used. These results contribute to exploring biomarkers in high-dimensional metabolomics datasets.
Publisher
Cold Spring Harbor Laboratory