Surprise upset: GPT-5.5 beats ... Note
VentureBeat

Surprise upset: GPT-5.5 beats Claude Fable 5 on brutal new Agents’ Last Exam benchmark

A new benchmark called Agents' Last Exam (ALE) has been launched to assess AI's ability to perform economically valuable, long-horizon professional tasks. OpenAI's GPT-5.5, surprisingly, achieved the top spot with a 24.0% pass rate, surpassing Anthropic's Claude Fable 5 model. ALE differs from previous benchmarks by evaluating AI on realistic workflows across five functional layers: reasoning, perception, orchestration, tool invocation, and runtime substrate. It demands agents navigate virtual machines using both terminal commands and graphical interfaces, with over 90% of grading being deterministic and code-based. The benchmark's tasks are sourced from real professional histories and cover 55 industry sub-domains, including software development, 3D modeling, and data analysis. Current top AI models are reportedly failing these authentic, long-horizon workflows, with pass rates on the hardest tier being as low as 0.0% for some advanced configurations. ALE combats benchmark contamination by keeping over 90% of its evaluation data private, releasing tasks incrementally. It also offers "Full" and "Unlicensed" leaderboards to distinguish performance with and without proprietary software access. The benchmark's rigorous grading curve provides a reality check for the AI industry, highlighting that even leading models have significant room for improvement before being ready for the professional workforce.
CdXz5zHNQW_uh8k3LCWo9.png