VentureBeat
Follow
When Claude changed, everything changed: Managing AI blast radius in production
The system in question effectively translated natural-language queries into API calls, serving analysts and account managers by streamlining data assembly from various sources. It accomplished this by dispatching API calls to integrated backends, applying an LLM-generated JSON query for shaping responses, and delivering results via email, Drive documents, or browser charts. By mid-2025, it had become the standard method for ad-hoc data retrieval, generating several hundred reports monthly for internal and external stakeholders.
The core interaction relied on a structured JSON object contract between the LLM and the system. Initial model upgrades from Claude Sonnet 3.5 to 4.0 were seamless, fostering complacency regarding LLM stability. However, the Sonnet 4.5 upgrade caused two major issues. First, the model began embedding post_body content into the description field, resulting in empty filter parameters for API calls, leading to broad data retrieval or 500 errors. Second, Sonnet 4.5 started posing clarifying questions, a feature for which the system, designed for direct API calls without human interaction or state management, had no established path.
These failures necessitated a rollback to Sonnet 4.0, complicated by new API integrations qualified against 4.5. This incident highlights how LLM-backed systems defy traditional engineering discipline, as internal components are not under developer control, leading to unpredictable "infinite blast radii" for changes. The post-mortem revealed an under-specified prompt; previous model versions had implicitly inferred constraints that Sonnet 4.5, being more "helpful," violated.
The authors propose an "evals-first" architecture, where an evaluation suite, rather than the prompt, serves as the formal system specification. Evals consist of an input, required output properties, and a scoring function to validate model or prompt changes. An example eval would check if the description field contained serialized payload content. While expensive to build and maintain, evals act as a gate, bounding the blast radius by densely sampling input-output behavior.
Despite their utility, evals are not a panacea; they can only catch specified failure modes and introduce their own variance via LLM-as-judge scoring. The engineering community still lacks standards for eval coverage in natural language and CI/CD systems for probabilistic test outcomes. Closing the gap between passing smoke tests and predicting production behavior, especially as agents become more autonomous, is a critical engineering challenge. Teams that prioritize evals as the system's true specification will be best equipped to meet this challenge.