One Org or Many? The Postmortem Nobody Wants to Write

A financial institution consolidated its business units under a single AWS Organization for cost governance, but a mistake in applying a Service Control Policy (SCP) to the root node caused a 47-minute production outage. The pressure for consolidated cost visibility, single policy enforcement, and simplified automation led to this architectural decision. A security team's attempt to restrict AWS regions for compliance, by applying an SCP at the root, inadvertently blocked critical DynamoDB calls for the payments pipeline. This occurred because the payments account used a DynamoDB Global Tables endpoint in a blocked region, and the error went undetected by initial monitoring. The incident timeline shows a gradual failure, from SDK errors to circuit breaker trips and finally an alert that initially misdirected troubleshooting efforts. The root cause was identified as an unrestricted blast radius by design, stemming from the single-Organization architecture, not just the human error. This design lacked isolation between regulated workloads like payments and operational workloads. A single Organization with hierarchical SCPs means any root-level SCP affects all accounts simultaneously with no native progressive rollback. The text contrasts this problematic single-Org pattern with a remediated multi-Org approach that provides stronger isolation boundaries. Multiple Organizations offer independent policy lifecycles, separate management account credentials, and cleaner regulatory isolation, which auditors accept more readily. While multiple Organizations increase operational complexity and landing zone automation duplication, these are addressable issues. The immediate technical remediation involved implementing a guardrail SCP to prevent attaching policies to the root or critical OUs. In the medium term, a second AWS Organization was created for PCI-DSS workloads, enhancing security and policy independence. Observability was improved by capturing AWS Organizations events to Slack and PagerDuty, drastically reducing mean time to detection. The core problem with SCPs is their immediate propagation and lack of staging or automatic rollback, unlike typical infrastructure code. To address this, an immutable state registry for SCPs was implemented, enabling automated rollbacks within seconds.

https://dev.to/fernando_azevedo_6844e930/one-org-or-many-the-postmortem-nobody-wants-to-write-5e50 dev.to

RSS Hunter • Yesterday