Methodology
BioBenchmarks is an independent, open catalog of AI/ML benchmarks across drug discovery, scored with a standardized rubric.
Scope
This portal catalogs 90 individual benchmarks and 31 initiatives (which together track 1,749 benchmarks, of which 2,413 are surfaced as direct per-benchmark links). It also ranks 87 experts and 79 groups that build benchmarks, and documents 28 known-private industry benchmarks for context.
Pipeline taxonomy (12 stages)
- Virtual Cell — cell-state foundation models, perturbation prediction
- Disease modeling — disease signatures, mechanism maps
- Target ID — target-disease association, essentiality, druggability
- Hit ID — virtual screening, docking, bioactivity
- Lead ID / ADMET — property prediction
- Developmental candidate — multi-parameter optimization
- IND-enabling — safety, tox, PK projection
- Phase I — human PK/PD, dose prediction
- Phase II — efficacy prediction, biomarker qualification
- Phase III — outcome prediction, endpoint modeling
- Clinical development — cross-phase trial design, patient stratification
- Post-market / RWE — adverse events, signal detection
Cross-cutting modalities (tagged alongside stage): small molecule · biologic · peptide · PROTAC · cell therapy · gene therapy · RNA therapeutic · vaccine.
Benchmark rubric (7 criteria, 1–5 each)
| Criterion | Question | 1 = Weak | 5 = Gold standard |
|---|---|---|---|
| Scientific rigor | Peer-reviewed? Reproducible? Negative controls? | No paper, no docs | High-impact paper, reproduction studies, community-validated |
| Coverage | Task breadth + data volume | Single narrow task, small N | Broad task suite, >10k data points, multi-modality |
| Active maintenance | Last update / PR / leaderboard submission | >2 years stale | Updated within 3 months, responsive maintainers |
| Community adoption | Citations, GitHub stars, leaderboard entries | <50 cites, <50 stars | >1000 cites or >1000 stars or >50 leaderboard entries |
| Data quality | Curation, QC, known-issue tracking | Uncurated, no QC | Expert-curated, version-controlled, known-issue list |
| Accessibility | License + install experience | Restrictive license, manual data pull | OSS license + one-line install + stable API |
| Industry relevance | Pharma-validated translational signal | No industry usage evidence | Used by ≥3 top-20 pharma publicly; drives internal decisions |
Composite: weighted mean (rigor ×1.5, coverage ×1.2, adoption ×1.2, others ×1.0) normalized to 0–100. Raw rubric is shown on every benchmark detail page.
Anti-gaming: if a benchmark is maintained by the same group whose model dominates it, it is flagged self_referential and adoption is down-weighted. Benchmarks that benchmark external frontier models against each other are not flagged self-referential.
Experimental-validation tier (new in v2.0)
| Tier | Meaning |
|---|---|
| Clinical | Benchmark uses real clinical-trial outcomes or patient data. |
| Wet-lab confirmed | Top predictions are synthesized / expressed / assayed in a wet lab. |
| Prospective | Designed as a held-out, forward-looking test; submissions evaluated on data unseen at design time. |
| Retrospective | Historical data only, split into train / test after the fact. |
| None | Pure simulation or benchmarking harness with no experimental grounding. |
Expert rubric (6 criteria)
- Benchmarks authored — number of benchmarks where the person is primary author or lead maintainer.
- Benchmark citations — aggregate citations of their benchmark papers.
- Scope — work spans multiple stages vs narrow.
- Community role — editor, workshop chair, challenge organizer, consortium lead.
- Recency — active in the last 2 years.
- Rigor flags — any retractions or questionable benchmark practices (negative indicator).
Group rubric (7 criteria)
- Output volume, quality median, breadth, openness, industry uptake, longevity, translational signal.
Update cadence
- Monthly: citation counts, leaderboard positions, newly released benchmarks.
- Quarterly: full rubric re-score (maintenance and adoption shift fast).
- Annually: taxonomy review (new stages, new modalities).
Data files
See /downloads. Self-documenting JSON Schema: schema.json.
How honest is the ranking?
As honest as we can make it. We differentiate scores aggressively — not everything is a 5. We explicitly flag self-referential benchmarks, data-leakage-known benchmarks, deprecated benchmarks with recommended replacements, and license-gated commercial benchmarks. When a field cannot be verified we write "unknown—<source checked>" rather than guess.