Methodology

BioBenchmarks is an independent, open catalog of AI/ML benchmarks across drug discovery, scored with a standardized rubric.

Scope

This portal catalogs 90 individual benchmarks and 31 initiatives (which together track 1,749 benchmarks, of which 2,413 are surfaced as direct per-benchmark links). It also ranks 87 experts and 79 groups that build benchmarks, and documents 28 known-private industry benchmarks for context.

Pipeline taxonomy (12 stages)

Virtual Cell — cell-state foundation models, perturbation prediction
Disease modeling — disease signatures, mechanism maps
Target ID — target-disease association, essentiality, druggability
Hit ID — virtual screening, docking, bioactivity
Lead ID / ADMET — property prediction
Developmental candidate — multi-parameter optimization
IND-enabling — safety, tox, PK projection
Phase I — human PK/PD, dose prediction
Phase II — efficacy prediction, biomarker qualification
Phase III — outcome prediction, endpoint modeling
Clinical development — cross-phase trial design, patient stratification
Post-market / RWE — adverse events, signal detection

Cross-cutting modalities (tagged alongside stage): small molecule · biologic · peptide · PROTAC · cell therapy · gene therapy · RNA therapeutic · vaccine.

Benchmark rubric (7 criteria, 1–5 each)

Criterion	Question	1 = Weak	5 = Gold standard
Scientific rigor	Peer-reviewed? Reproducible? Negative controls?	No paper, no docs	High-impact paper, reproduction studies, community-validated
Coverage	Task breadth + data volume	Single narrow task, small N	Broad task suite, >10k data points, multi-modality
Active maintenance	Last update / PR / leaderboard submission	>2 years stale	Updated within 3 months, responsive maintainers
Community adoption	Citations, GitHub stars, leaderboard entries	<50 cites, <50 stars	>1000 cites or >1000 stars or >50 leaderboard entries
Data quality	Curation, QC, known-issue tracking	Uncurated, no QC	Expert-curated, version-controlled, known-issue list
Accessibility	License + install experience	Restrictive license, manual data pull	OSS license + one-line install + stable API
Industry relevance	Pharma-validated translational signal	No industry usage evidence	Used by ≥3 top-20 pharma publicly; drives internal decisions

Composite: weighted mean (rigor ×1.5, coverage ×1.2, adoption ×1.2, others ×1.0) normalized to 0–100. Raw rubric is shown on every benchmark detail page.

Anti-gaming: if a benchmark is maintained by the same group whose model dominates it, it is flagged self_referential and adoption is down-weighted. Benchmarks that benchmark external frontier models against each other are not flagged self-referential.

Experimental-validation tier (new in v2.0)

Tier	Meaning
Clinical	Benchmark uses real clinical-trial outcomes or patient data.
Wet-lab confirmed	Top predictions are synthesized / expressed / assayed in a wet lab.
Prospective	Designed as a held-out, forward-looking test; submissions evaluated on data unseen at design time.
Retrospective	Historical data only, split into train / test after the fact.
None	Pure simulation or benchmarking harness with no experimental grounding.

Expert rubric (6 criteria)

Benchmarks authored — number of benchmarks where the person is primary author or lead maintainer.
Benchmark citations — aggregate citations of their benchmark papers.
Scope — work spans multiple stages vs narrow.
Community role — editor, workshop chair, challenge organizer, consortium lead.
Recency — active in the last 2 years.
Rigor flags — any retractions or questionable benchmark practices (negative indicator).

Group rubric (7 criteria)

Output volume, quality median, breadth, openness, industry uptake, longevity, translational signal.

Update cadence

Monthly: citation counts, leaderboard positions, newly released benchmarks.
Quarterly: full rubric re-score (maintenance and adoption shift fast).
Annually: taxonomy review (new stages, new modalities).

Data files

See /downloads. Self-documenting JSON Schema: schema.json.

How honest is the ranking?

As honest as we can make it. We differentiate scores aggressively — not everything is a 5. We explicitly flag self-referential benchmarks, data-leakage-known benchmarks, deprecated benchmarks with recommended replacements, and license-gated commercial benchmarks. When a field cannot be verified we write "unknown—<source checked>" rather than guess.

Compare:

Open comparison →