A field guide to writing your own eval harness
Why "vibes-based" testing collapses past 50 prompts, and the smallest harness that scales without becoming a second product.
Why "vibes-based" testing collapses past 50 prompts, and the smallest harness that scales without becoming a second product.
"The taxonomy I keep coming back to. Workflows vs. agents, with worked patterns."
via Anthropic"A pragmatic field report. The section on guardrail budgets is worth the read alone."
via MediumSame 12k-line TypeScript codebase, same task: extract a domain layer. I ran every agent twice and graded the diffs.
"Skim the intro, read the middle. Their framing of "negotiated autonomy" is sticky."
via The New StackMost MCP servers I see are CRUD wrappers. Here is what changes when you design tools as if a model were a junior engineer with no memory.
A 90th-percentile breakdown of where the seconds actually go in a tool-using chat. Spoiler: it is not the model.
When the user is not the only one reading your UI. A short manifesto on machine-legible interfaces.