Abstract
AbstractWith the emergence of single-cell foundation models, an important question arises: how do these models perform when trained on datasets having an imbalance in cell type distribution due to rare cell types or biased sampling? We benchmark three foundation models, scGPT, scBERT, and Geneformer, using skewed single-cell cell-type distribution for cell-type annotation. While all models had reduced performance when challenged with rare cell types, scGPT and scBERT, performed better than Geneformer. Notably, in contrast to scGPT and scBERT, Geneformer uses ordinal positions of the tokenized genes rather than actual raw gene expression values. To mitigate the effect of a skewed distribution, we find that random oversampling, but not random undersampling, improved the performance for all three foundation models. Finally, scGPT, using FlashAttention, has the fastest computational speed, whereas scBERT is more memory-efficient. We conclude that tokenization and data representation are essential areas of research, and new strategies are needed to mitigate the effects of imbalanced learning in single-cell foundation models. Code and data for reproducibility are available athttps://github.com/SabbaghCodes/ImbalancedLearningForSingleCellFoundationModels.
Publisher
Cold Spring Harbor Laboratory
Reference13 articles.
1. OpenAI. GPT-4 Technical Report, March 2023. arXiv: 2303.08774 [cs].
2. Hugo Touvron , Thibaut Lavril , Gautier Izacard , Xavier Martinet , Marie-Anne Lachaux , Timothée Lacroix , Baptiste Rozière , Naman Goyal , Eric Hambro , Faisal Azhar , Aurelien Rodriguez , Armand Joulin , Edouard Grave , and Guillaume Lample . LLaMA: Open and Efficient Foundation Language Models, February 2023. arXiv: 2302.13971 [cs].
3. Bioinformatics mining and modeling methods for the identification of disease mechanisms in neurodegenerative disorders;International Journal of Molecular Sciences,2015
4. Stategra: multi-omics data integration–a conceptual scheme with a bioinformatics pipeline;Frontiers in genetics,2021
5. Translating single-cell genomics into cell types;Nature Machine Intelligence,2023
Cited by
3 articles.
订阅此论文施引文献
订阅此论文施引文献,注册后可以免费订阅5篇论文的施引文献,订阅后可以查看论文全部施引文献