#agents

A field guide to writing your own eval harness

Why "vibes-based" testing collapses past 50 prompts, and the smallest harness that scales without becoming a second product.

"The taxonomy I keep coming back to. Workflows vs. agents, with worked patterns."

"A pragmatic field report. The section on guardrail budgets is worth the read alone."

Same 12k-line TypeScript codebase, same task: extract a domain layer. I ran every agent twice and graded the diffs.

"Skim the intro, read the middle. Their framing of "negotiated autonomy" is sticky."

Most MCP servers I see are CRUD wrappers. Here is what changes when you design tools as if a model were a junior engineer with no memory.

A 90th-percentile breakdown of where the seconds actually go in a tool-using chat. Spoiler: it is not the model.

When the user is not the only one reading your UI. A short manifesto on machine-legible interfaces.