← back to feed
014 · April 28, 2026 · 12 min read
14

A field guide to writing your own eval harness

Why "vibes-based" testing collapses past 50 prompts, and the smallest harness that scales without becoming a second product.

Most teams I talk to are running their evals on instinct. They have a chat window, a hunch, and twelve tabs of comparative outputs. This works — until it doesn't. Around the fiftieth prompt, the room gets very quiet, because nobody can remember whether the last change made things better or just different.

§ 01 Why vibes fail past 50 prompts

There is no shame in vibes. Every eval starts as a vibe. The problem is that human attention does not scale linearly with output volume — it decays. By prompt forty, you are not evaluating; you are skimming.

A modest harness fixes this not by being clever, but by being patient. It looks at every output the same way, every time. That is its whole job.

harness.py python
class=class="tok-s">"tok-c"># the smallest useful eval loop
def run(suite, model):
    for case in suite:
        out = model(case.prompt)
        score = case.grade(out)        class=class="tok-s">"tok-c"># str | bool | float
        record(case.id, out, score)    class=class="tok-s">"tok-c"># append-only, never mutate
    return summarize()

§ 02 Three graders, no more

You only need three kinds of grader, and you should resist adding a fourth. Exact-match for things you can pin down. Rubric-LLM for things you cannot. Human for the cases the first two disagree on. Every additional grader type is a maintenance bill.

Grader cost vs. signalnormalized — n=240 cases
Exact-match
0.22
Rubric-LLM
0.64
Human review
0.92
Hybrid stack
0.88

§ 03 The eval set is a living document

Treat the eval set like product copy, not like a unit test. Cases will go stale. The behavior you want today is not the behavior you wanted six months ago. Delete liberally; rewrite without ceremony.

Eval datasets rot faster than code. The most dangerous file in your repo is the one you wrote during onboarding and never opened again.

§ 04 The shape of a passing run

A run should fit in your head. If you cannot scan the entire summary in one breath, the harness has grown a second product around it, and you are now maintaining that.

RunCasesPassΔ vs prevp95 latency
#142240218 (90.8%)+1.7%4.2s
#141240214 (89.2%)−0.4%4.1s
#140236211 (89.4%)+2.1%3.9s
#139236206 (87.3%)4.0s

That is the entire harness. Anything else you build on top of this should justify its existence against a single question: does it help me decide whether the last change was good?