Most teams I talk to are running their evals on instinct. They have a chat window, a hunch, and twelve tabs of comparative outputs. This works — until it doesn't. Around the fiftieth prompt, the room gets very quiet, because nobody can remember whether the last change made things better or just different.
§ 01 Why vibes fail past 50 prompts
There is no shame in vibes. Every eval starts as a vibe. The problem is that human attention does not scale linearly with output volume — it decays. By prompt forty, you are not evaluating; you are skimming.
A modest harness fixes this not by being clever, but by being patient. It looks at every output the same way, every time. That is its whole job.
class=class="tok-s">"tok-c"># the smallest useful eval loop def run(suite, model): for case in suite: out = model(case.prompt) score = case.grade(out) class=class="tok-s">"tok-c"># str | bool | float record(case.id, out, score) class=class="tok-s">"tok-c"># append-only, never mutate return summarize()
§ 02 Three graders, no more
You only need three kinds of grader, and you should resist adding a fourth. Exact-match for things you can pin down. Rubric-LLM for things you cannot. Human for the cases the first two disagree on. Every additional grader type is a maintenance bill.
§ 03 The eval set is a living document
Treat the eval set like product copy, not like a unit test. Cases will go stale. The behavior you want today is not the behavior you wanted six months ago. Delete liberally; rewrite without ceremony.
Eval datasets rot faster than code. The most dangerous file in your repo is the one you wrote during onboarding and never opened again.
§ 04 The shape of a passing run
A run should fit in your head. If you cannot scan the entire summary in one breath, the harness has grown a second product around it, and you are now maintaining that.
| Run | Cases | Pass | Δ vs prev | p95 latency |
|---|---|---|---|---|
| #142 | 240 | 218 (90.8%) | +1.7% | 4.2s |
| #141 | 240 | 214 (89.2%) | −0.4% | 4.1s |
| #140 | 236 | 211 (89.4%) | +2.1% | 3.9s |
| #139 | 236 | 206 (87.3%) | — | 4.0s |
That is the entire harness. Anything else you build on top of this should justify its existence against a single question: does it help me decide whether the last change was good?