BioDesignBench

Benchmark for evaluating LLM agents in protein design. 76 expert-curated tasks covering antibodies, enzymes, fluorescent proteins, binders, and scaffolds. Integrates AlphaFold, RFdiffusion, ProteinMPNN, Rosetta. Measures tool-use behavior and design quality.

Composite

77.7

Experimental validation

Retrospective

Stages

Hit IDLead ID / ADMET

Modalities

protein_structureai_agentprotein_sequence

Task types

protein_designagent_evaluationtool_use

Size

expert-curated_tasks: 76
protein_categories: 5

License

Unknown

First release

2026-05

Last updated

2026-05

Official site

→ project page

Leaderboard

→ leaderboard

Dataset

→ dataset

Code / GitHub

→ repository

HuggingFace

→ HF

Paper

BioDesignBench: Benchmarking LLM Agents for Protein Design · · 2026 · paper · doi:10.64898/2026.05.06.723381 · 0 citations

Flags

agent_benchmarkprotein_engineering

Experts

—

Groups

—

Hosted by

—

Related benchmarks

Protein Design Benchmark 2026, ProteinGym

Rubric (7-criterion)

rigor

coverage

maintenance

adoption

quality

accessibility

industry_relevance

Notes

Key finding: LLM agents select appropriate tools but evaluate designs superficially, rarely comparing alternatives. Strongest agents surpass hardcoded pipelines but underperform human experts. Enforcing deeper evaluation substantially improves performance.

← Back to all benchmarks

Compare:

Open comparison →