A field guide to writing your own eval harness
Why "vibes-based" testing collapses past 50 prompts, and the smallest harness that scales without becoming a second product.
APR 28, 2026 ·
Why "vibes-based" testing collapses past 50 prompts, and the smallest harness that scales without becoming a second product.
"Annual roundup. Section on tool-calling reliability is gold."
via simonwillison.netMost MCP servers I see are CRUD wrappers. Here is what changes when you design tools as if a model were a junior engineer with no memory.
"The PageRank-on-symbols trick is more interesting than the headline feature."
via aider.chat