SWE-Bench Pro
4 mentions across all digests
SWE-Bench Pro is a contamination-resistant software engineering benchmark evaluating AI models on multi-language coding tasks, on which GLM-5.1, GPT-5.3-Codex, and GPT-5.4 mini have each claimed state-of-the-art results.
[AINews] Anthropic Claude Opus 4.7 - literally one step better than 4.6 in every dimension
Anthropic's Claude Opus 4.7 claims #1 benchmark rankings with 3x vision resolution (2,576px) and up to 50% token efficiency gains via a new tokenizer and xhigh reasoning effort level.
Introducing GPT-5.3-Codex
AI joins the 8-hour work day as GLM ships 5.1 open source LLM, beating Opus 4.6 and GPT 5.4 on SWE-Bench Pro
Zhipu AI's open-source GLM 5.1 outperforms Claude Opus 4.6 and GPT 5.4 on SWE-Bench Pro, signaling open-source models are closing the competitive gap on frontier software engineering benchmarks.
Introducing GPT-5.4 mini and nano
OpenAI releases GPT-5.4 mini and nano variants achieving 2x speed improvements over GPT-5 mini while maintaining near-equivalent performance, targeting cost-sensitive agentic and real-time applications.