A field guide to writing your own eval harness

Most teams I talk to are running their evals on instinct. They have a chat window, a hunch, and twelve tabs of comparative outputs. This works — until it doesn't. Around the fiftieth prompt, the room gets very quiet, because nobody can remember whether the last change made things better or just different.

§ 01 Why vibes fail past 50 prompts

There is no shame in vibes. Every eval starts as a vibe. The problem is that human attention does not scale linearly with output volume — it decays. By prompt forty, you are not evaluating; you are skimming.

A modest harness fixes this not by being clever, but by being patient. It looks at every output the same way, every time. That is its whole job.

harness.py python

class=class="tok-s">"tok-c"># the smallest useful eval loop
def run(suite, model):
    for case in suite:
        out = model(case.prompt)
        score = case.grade(out)        class=class="tok-s">"tok-c"># str | bool | float
        record(case.id, out, score)    class=class="tok-s">"tok-c"># append-only, never mutate
    return summarize()

§ 02 Three graders, no more

You only need three kinds of grader, and you should resist adding a fourth. Exact-match for things you can pin down. Rubric-LLM for things you cannot. Human for the cases the first two disagree on. Every additional grader type is a maintenance bill.

Grader cost vs. signalnormalized — n=240 cases

Exact-match

0.22

Rubric-LLM

0.64

Human review

0.92

Hybrid stack

0.88

§ 03 The eval set is a living document

Treat the eval set like product copy, not like a unit test. Cases will go stale. The behavior you want today is not the behavior you wanted six months ago. Delete liberally; rewrite without ceremony.

Eval datasets rot faster than code. The most dangerous file in your repo is the one you wrote during onboarding and never opened again.

§ 04 The shape of a passing run

A run should fit in your head. If you cannot scan the entire summary in one breath, the harness has grown a second product around it, and you are now maintaining that.

Run	Cases	Pass	Δ vs prev	p95 latency
#142	240	218 (90.8%)	+1.7%	4.2s
#141	240	214 (89.2%)	−0.4%	4.1s
#140	236	211 (89.4%)	+2.1%	3.9s
#139	236	206 (87.3%)	—	4.0s

That is the entire harness. Anything else you build on top of this should justify its existence against a single question: does it help me decide whether the last change was good?

§ 01 Why vibes fail past 50 prompts

§ 02 Three graders, no more

§ 03 The eval set is a living document

§ 04 The shape of a passing run

Related entries