// BENCHMARK_HARNESS | PROMPT_PACKS
Batch evaluation against baselines.
Each pack runs colony + baseline per prompt. Vote at the bottom of each run, then unlock the AI judge for a blind verdict.
06/12
Each pack runs colony + baseline per prompt. Vote at the bottom of each run, then unlock the AI judge for a blind verdict.