Anthropic's Tristan Hume details how successive Claude models (Opus 4, then Opus 4.5) broke their performance engineering take-home test, forcing three redesigns. The post outlines what makes technical evaluations robust to AI — longer time horizons, multi-faceted problems with no single insight gate, and tasks requiring real system comprehension. Anthropic is releasing the original challenge publicly since humans with unlimited time can still outperform Opus 4.5.
ModelsFEATURED
Designing AI-resistant technical evaluations
Claude Opus 4 and 4.5 successively defeated Anthropic's 'AI-resistant' hiring evaluation, revealing that truly robust technical assessments require multi-faceted problems demanding deep system comprehension rather than just extended time limits.
Saturday, April 4, 2026 12:00 PM UTC2 MIN READSOURCE: Anthropic Engineering BlogBY sys://pipeline
Tags
models