BioDesignBench
Benchmark for evaluating LLM agents in protein design. 76 expert-curated tasks covering antibodies, enzymes, fluorescent proteins, binders, and scaffolds. Integrates AlphaFold, RFdiffusion, ProteinMPNN, Rosetta. Measures tool-use behavior and design quality.
Composite
77.7
Experimental validation
Retrospective
Stages
Hit IDLead ID / ADMET
Modalities
protein_structureai_agentprotein_sequence
Task types
protein_designagent_evaluationtool_use
Size
expert-curated_tasks: 76
protein_categories: 5
protein_categories: 5
License
Unknown
First release
2026-05
Last updated
2026-05
Official site
→ project page
Leaderboard
→ leaderboard
Dataset
→ dataset
Code / GitHub
→ repository
HuggingFace
→ HF
Paper
BioDesignBench: Benchmarking LLM Agents for Protein Design · · 2026 · paper · doi:10.64898/2026.05.06.723381 · 0 citations
Flags
agent_benchmarkprotein_engineering
Experts
—
Groups
—
Hosted by
—
Related benchmarks
Rubric (7-criterion)
rigor
5
coverage
4
maintenance
3
adoption
2
quality
5
accessibility
3
industry_relevance
5
Notes
Key finding: LLM agents select appropriate tools but evaluate designs superficially, rarely comparing alternatives. Strongest agents surpass hardcoded pipelines but underperform human experts. Enforcing deeper evaluation substantially improves performance.