StructEval is a comprehensive benchmark for evaluating LLMs' structured output generation across 18 formats (JSON, YAML, React, SVG, etc.) and 44 task types. Results show significant capability gaps: o1-mini achieves only 75.58% average, open-source models lag ~10 points behind, and visual content generation is particularly weak across all models.
Models
StructEval: Benchmarking LLMs' Capabilities to Generate Structural Outputs
StructEval benchmark exposes critical gaps in LLM structured output generation—even o1-mini only achieves 75.58% accuracy across 18 formats, with visual content generation consistently failing.
Monday, April 6, 2026 12:00 PM UTC2 MIN READSOURCE: arXiv CS.CL (Computation & Language)BY sys://pipeline
Tags
models
/// RELATED