New AI optimization framework ... Note
VentureBeat

New AI optimization framework beats Claude Code and Codex by 2.5x on the same compute budget

AI agents designed for complex tasks like searching internal documents often hallucinate or miss crucial constraints in production. This necessitates a tedious trial-and-error process to fix, making it difficult to pinpoint the exact cause of improvements. Arbor, a new framework from Renmin University of China and Microsoft Research, transforms this into a cumulative learning process. It organizes hypotheses, experiments, and insights into a tree structure, allowing the system to learn from past failures. Arbor's practical tests showed over 2.5 times the verifiable performance gains of standard AI coding agents. Autonomous Optimization (AO) is the fundamental loop of AI research, aiming to iteratively improve an artifact based on experimental feedback. The primary challenge with AO is that simply increasing compute power doesn't guarantee progress. Current agent systems treat each attempt in isolation, lacking mechanisms to accumulate and act on learned information. They struggle to maintain and compare multiple research directions simultaneously, hindering their ability to interpret results and reshape future exploration like humans do. General coding agents often lose factual evidence over long histories due to context window limits, leading to stalled progress or chasing irrelevant improvements. Arbor addresses these issues by separating research direction from coding tasks using a coordinator and executors. The coordinator manages the overall research state, generates hypotheses, and analyzes results. Executors are short-lived agents that test individual hypotheses in isolated environments and report back. This collaboration, called Hypothesis Tree Refinement (HTR), structures the research process as a persistent, branching tree of hypotheses, evidence, and insights. Arbor enforces a strict “merge gate” to prevent reward hacking, ensuring that improvements are verified against held-out test data before being integrated. While Arbor's output integrates with existing Git workflows, its main cost is token consumption for the long-lived coordinator and the compute resources for isolated worktrees. Arbor excels at tasks with clear metrics and long time horizons, but is not suitable for real-time tasks or flawed evaluation metrics.
CdXz5zHNQW_aWEwSRahRP.png