Benchmarking 7 coding agents on a real refactor Same 12k-line TypeScript codebase, same task: extract a domain layer. I ran every agent twice and graded the diffs. APR 22, 2026 · benchmarksagentsrefactoringdata