Abstract
Background
The identification of optimal candidate genes from large-scale blood transcriptomic data is crucial for developing targeted assays to monitor immune responses. Here, we employ a large language model (LLM)-based approach for prioritizing candidate biomarkers from blood transcriptional modules.
Methods
Focusing on module M14.51 from the BloodGen3 repertoire, which is associated with erythroid cells and erythropoiesis, we utilized OpenAI's GPT-4 and Anthropic's Claude to score and rank the module's constituent genes across six criteria: relevance to erythroid biology, existing biomarkers, potential as a blood biomarker, leukocyte immune biology, drug targeting, and immune disease therapeutics. The LLMs were then used to select a top candidate gene based on the scoring justifications. Reference transcriptome data was incorporated to validate the selection.
Results
The LLMs consistently identified Glutathione Peroxidase 4 (GPX4) as the top candidate gene for module M14.51. GPX4's role in oxidative stress regulation, its potential as a future drug target, and its expression across diverse immune cell types supported its selection. The incorporation of reference transcriptome data further validated GPX4 as the most suitable candidate for this module.
Conclusions
Our LLM-driven workflow enhances the efficiency of candidate gene prioritization, enabling the development of biologically relevant and clinically informative targeted assays. The identification of GPX4 as a key gene in the erythroid cell-associated module M14.51 highlights the potential of this approach for biomarker discovery and targeted assay development.