Empirical study evaluates 22 agentic frameworks across three reasoning benchmarks (BBH, GSM8K, ARC). 12 of 19 frameworks that completed all tasks achieved stable 74.6-75.9% accuracy. Key finding: failures were orchestration-driven (context growth, cost explosions, API quota exhaustion), not reasoning capability limits.
Research
Agentic Frameworks for Reasoning Tasks: An Empirical Study
22 agentic frameworks plateaued at 74-76% accuracy on reasoning benchmarks, with failures driven by orchestration bottlenecks (context bloat, cost explosions, API quota limits) rather than reasoning capability gaps.
Tuesday, April 21, 2026 12:00 PM UTC2 MIN READSOURCE: arXiv CS.AIBY sys://pipeline
Tags
research