Abstract
The increasing use of foundation models in biomedical applications raises opportunities and challenges to analyze the information captured in the high-dimensional embedding spaces of different models. Existing tools offer limited capabilities for comparing information represented in the embedding spaces of different models. We introduceema-tool, a Python library designed to analyze and compare embeddings from different models for a set of samples, focusing on the representation of groups known to share similarities.ema-toolexamines pairwise distances to uncover local and global patterns and tracks the representations and relationships of these groups across different embedding spaces. We demonstrate the use ofema-toolthrough two examples. In the first example, we analyze the representation of ion channel proteins across versions of the ESM protein language models. In the second example, we analyze the representation of genetic variants within theHCN1gene across these models. The source code is available athttps://github.com/broadinstitute/ema.
Publisher
Cold Spring Harbor Laboratory
Reference29 articles.
1. Recent advances in natural language processing via large pre-trained language models: A survey;ACM Computing Surveys,2023
2. Gpt-4 technical report;arXiv preprint,2023
3. Bert: Pretraining of deep bidirectional transformers for language understanding;arXiv preprint,2018
4. Bart: Denoising sequence-to-sequence pre-training for natural language generation, translation, and comprehension;arXiv preprint,2019
5. Giorgio Valentini , Dario Malchiodi , Jessica Gliozzo , Marco Mesiti , Mauricio Soto-Gomez , Alberto Cabri , Justin Reese , Elena Casiraghi , and Peter N Robinson . The promises of large language models for protein design and modeling. Frontiers in Bioinformatics, 3, 2023.