GAN-based data augmentation for transcriptomics: survey and comparative assessment-Reference-Cited by-同舟云学术

GAN-based data augmentation for transcriptomics: survey and comparative assessment

Published:2023-06-01 Issue:Supplement_1 Volume:39 Page:i111-i120
ISSN:1367-4803
Container-title:Bioinformatics
language:en
Short-container-title:

Author:

Lacan Alice¹,Sebag Michèle²,Hanczar Blaise¹

Affiliation:

1. IBISC, University Paris-Saclay (Univ. Evry) , Evry 91000, France

2. TAU, CNRS-INRIA-LISN, University Paris-Saclay , Gif-sur-Yvette 91190, France

Abstract

Abstract Motivation Transcriptomics data are becoming more accessible due to high-throughput and less costly sequencing methods. However, data scarcity prevents exploiting deep learning models’ full predictive power for phenotypes prediction. Artificially enhancing the training sets, namely data augmentation, is suggested as a regularization strategy. Data augmentation corresponds to label-invariant transformations of the training set (e.g. geometric transformations on images and syntax parsing on text data). Such transformations are, unfortunately, unknown in the transcriptomic field. Therefore, deep generative models such as generative adversarial networks (GANs) have been proposed to generate additional samples. In this article, we analyze GAN-based data augmentation strategies with respect to performance indicators and the classification of cancer phenotypes. Results This work highlights a significant boost in binary and multiclass classification performances due to augmentation strategies. Without augmentation, training a classifier on only 50 RNA-seq samples yields an accuracy of, respectively, 94% and 70% for binary and tissue classification. In comparison, we achieved 98% and 94% of accuracy when adding 1000 augmented samples. Richer architectures and more expensive training of the GAN return better augmentation performances and generated data quality overall. Further analysis of the generated data shows that several performance indicators are needed to assess its quality correctly. Availability and implementation All data used for this research are publicly available and comes from The Cancer Genome Atlas. Reproducible code is available on the GitLab repository: https://forge.ibisc.univ-evry.fr/alacan/GANs-for-transcriptomics

Funder

Labex DigiCosme

University Paris-Saclay

French National Research Agency

Publisher

Oxford University Press (OUP)

Subject

Computational Mathematics,Computational Theory and Mathematics,Computer Science Applications,Molecular Biology,Biochemistry,Statistics and Probability

Link

https://academic.oup.com/bioinformatics/article-pdf/39/Supplement_1/i111/50741877/btad239.pdf

Reference57 articles.

Cited by 3 articles. 订阅此论文施引文献订阅此论文施引文献，注册后可以免费订阅5篇论文的施引文献，订阅后可以查看论文全部施引文献

1. In Silico Generation of Gene Expression profiles using Diffusion Models;2024-04-13

2. Generating bulk RNA-Seq gene expression data based on generative deep learning models and utilizing it for data augmentation;Computers in Biology and Medicine;2024-02

3. Multiorgan locked-state model of chronic diseases and systems pharmacology opportunities;Drug Discovery Today;2024-01