WebArena
2 mentions across all digests
Benchmark for evaluating LLM web agents on realistic long-horizon tasks, on which environment-map-equipped agents achieved a 28.2% success rate versus a 14.2% baseline.
How We Broke Top AI Agent Benchmarks: And What Comes Next
UC Berkeley researchers gamed 8 major AI benchmarks with simple exploits, revealing that widely-cited AI performance claims may measure benchmark vulnerabilities rather than real task-solving capability.
Environment Maps: Structured Environmental Representations for Long-Horizon Agents
Environment Maps — persistent graph-based representations — nearly double LLM agent success rates on complex software tasks, achieving 28.2% accuracy on WebArena versus 14.2% baseline by consolidating execution traces into structured contexts.