Methodology

BioBenchmarks is an independent, open catalog of AI/ML benchmarks across drug discovery, scored with a standardized rubric.

Scope

This portal catalogs 90 individual benchmarks and 31 initiatives (which together track 1,749 benchmarks, of which 2,413 are surfaced as direct per-benchmark links). It also ranks 87 experts and 79 groups that build benchmarks, and documents 28 known-private industry benchmarks for context.

Pipeline taxonomy (12 stages)

  1. Virtual Cell — cell-state foundation models, perturbation prediction
  2. Disease modeling — disease signatures, mechanism maps
  3. Target ID — target-disease association, essentiality, druggability
  4. Hit ID — virtual screening, docking, bioactivity
  5. Lead ID / ADMET — property prediction
  6. Developmental candidate — multi-parameter optimization
  7. IND-enabling — safety, tox, PK projection
  8. Phase I — human PK/PD, dose prediction
  9. Phase II — efficacy prediction, biomarker qualification
  10. Phase III — outcome prediction, endpoint modeling
  11. Clinical development — cross-phase trial design, patient stratification
  12. Post-market / RWE — adverse events, signal detection

Cross-cutting modalities (tagged alongside stage): small molecule · biologic · peptide · PROTAC · cell therapy · gene therapy · RNA therapeutic · vaccine.

Benchmark rubric (7 criteria, 1–5 each)

CriterionQuestion1 = Weak5 = Gold standard
Scientific rigorPeer-reviewed? Reproducible? Negative controls?No paper, no docsHigh-impact paper, reproduction studies, community-validated
CoverageTask breadth + data volumeSingle narrow task, small NBroad task suite, >10k data points, multi-modality
Active maintenanceLast update / PR / leaderboard submission>2 years staleUpdated within 3 months, responsive maintainers
Community adoptionCitations, GitHub stars, leaderboard entries<50 cites, <50 stars>1000 cites or >1000 stars or >50 leaderboard entries
Data qualityCuration, QC, known-issue trackingUncurated, no QCExpert-curated, version-controlled, known-issue list
AccessibilityLicense + install experienceRestrictive license, manual data pullOSS license + one-line install + stable API
Industry relevancePharma-validated translational signalNo industry usage evidenceUsed by ≥3 top-20 pharma publicly; drives internal decisions

Composite: weighted mean (rigor ×1.5, coverage ×1.2, adoption ×1.2, others ×1.0) normalized to 0–100. Raw rubric is shown on every benchmark detail page.

Anti-gaming: if a benchmark is maintained by the same group whose model dominates it, it is flagged self_referential and adoption is down-weighted. Benchmarks that benchmark external frontier models against each other are not flagged self-referential.

Experimental-validation tier (new in v2.0)

TierMeaning
ClinicalBenchmark uses real clinical-trial outcomes or patient data.
Wet-lab confirmedTop predictions are synthesized / expressed / assayed in a wet lab.
ProspectiveDesigned as a held-out, forward-looking test; submissions evaluated on data unseen at design time.
RetrospectiveHistorical data only, split into train / test after the fact.
NonePure simulation or benchmarking harness with no experimental grounding.

Expert rubric (6 criteria)

Group rubric (7 criteria)

Update cadence

Data files

See /downloads. Self-documenting JSON Schema: schema.json.

How honest is the ranking?

As honest as we can make it. We differentiate scores aggressively — not everything is a 5. We explicitly flag self-referential benchmarks, data-leakage-known benchmarks, deprecated benchmarks with recommended replacements, and license-gated commercial benchmarks. When a field cannot be verified we write "unknown—<source checked>" rather than guess.

Compare:
Open comparison →