I gave seven coding agents the same task: extract the domain layer from a 12,000-line TypeScript codebase. Same prompt, two attempts each, identical environment. Then I read every diff.
§ 01 The setup
The codebase was a real one — a side project, not a benchmark — with the usual tangle of business logic stuffed into route handlers. The brief was deliberately under-specified, the way real refactor tickets are.