BioCoder: a benchmark for bioinformatics code generation with large language models-Reference-Cited by-同舟云学术

BioCoder: a benchmark for bioinformatics code generation with large language models

Published:2024-06-28 Issue:Supplement_1 Volume:40 Page:i266-i276
ISSN:1367-4803
Container-title:Bioinformatics
language:en
Short-container-title:

Author:

Tang Xiangru¹,Qian Bill¹,Gao Rick¹,Chen Jiakang¹,Chen Xinyun²,Gerstein Mark B¹³⁴⁵⁶^ORCID

Affiliation:

1. Department of Computer Science, Yale University , New Haven, CT 06520, United States

2. Google Deepmind , Mountain View, CA 94043, United States

3. Program in Computational Biology & Bioinformatics, Yale University , New Haven, CT 06520, United States

4. Department of Molecular Biophysics & Biochemistry, Yale University , New Haven, CT 06520, United States

5. Department of Statistics & Data Science, Yale University , New Haven, CT 06520, United States

6. Department of Biomedical Informatics & Data Science, Yale University , New Haven, CT 06520, United States

Abstract

Abstract Summary Pretrained large language models (LLMs) have significantly improved code generation. As these models scale up, there is an increasing need for the output to handle more intricate tasks and to be appropriately specialized to particular domains. Here, we target bioinformatics due to the amount of domain knowledge, algorithms, and data operations this discipline requires. We present BioCoder, a benchmark developed to evaluate LLMs in generating bioinformatics-specific code. BioCoder spans much of the field, covering cross-file dependencies, class declarations, and global variables. It incorporates 1026 Python functions and 1243 Java methods extracted from GitHub, along with 253 examples from the Rosalind Project, all pertaining to bioinformatics. Using topic modeling, we show that the overall coverage of the included code is representative of the full spectrum of bioinformatics calculations. BioCoder incorporates a fuzz-testing framework for evaluation. We have applied it to evaluate various models including InCoder, CodeGen, CodeGen2, SantaCoder, StarCoder, StarCoder+, InstructCodeT5+, GPT-3.5, and GPT-4. Furthermore, we fine-tuned one model (StarCoder), demonstrating that our training dataset can enhance the performance on our testing benchmark (by >15% in terms of Pass@K under certain prompt configurations and always >3%). The results highlight two key aspects of successful models: (i) Successful models accommodate a long prompt (>2600 tokens) with full context, including functional dependencies. (ii) They contain domain-specific knowledge of bioinformatics, beyond just general coding capability. This is evident from the performance gain of GPT-3.5/4 compared to the smaller models on our benchmark (50% versus up to 25%). Availability and implementation All datasets, benchmark, Docker images, and scripts required for testing are available at: https://github.com/gersteinlab/biocoder and https://biocoder-benchmark.github.io/.

Funder

Schmidt Futures

Publisher

Oxford University Press (OUP)

Link

https://academic.oup.com/bioinformatics/article-pdf/40/Supplement_1/i266/59088970/btae230.pdf

Reference63 articles.

Cited by 1 articles. 订阅此论文施引文献订阅此论文施引文献，注册后可以免费订阅5篇论文的施引文献，订阅后可以查看论文全部施引文献

1. Can Large Language Models Write Parallel Code?;Proceedings of the 33rd International Symposium on High-Performance Parallel and Distributed Computing;2024-06-03