// BENCHMARK_HARNESS | PROMPT_PACKS

Batch evaluation against baselines.

Each pack runs colony + baseline per prompt. Vote at the bottom of each run, then unlock the AI judge for a blind verdict.

Pack

Tick Budget06/12

Baseline

Prompts · 0