Benchmarking 7 coding agents on a real refactor
Same 12k-line TypeScript codebase, same task: extract a domain layer. I ran every agent twice and graded the diffs.
APR 22, 2026 ·
Same 12k-line TypeScript codebase, same task: extract a domain layer. I ran every agent twice and graded the diffs.
A 90th-percentile breakdown of where the seconds actually go in a tool-using chat. Spoiler: it is not the model.