DEV Community
Follow
Postmortem: The 2026 Slack Outage Due to Istio 1.22 Circuit Breaker Misconfiguration
Slack experienced a widespread outage on March 12, 2026, lasting over two hours, affecting 18.2 million users. The root cause was a misconfigured circuit breaker in Istio 1.22 used by the chat-api service. The faulty configuration caused all chat-api pods to be ejected after a single transient database connection error. This resulted in the failure of core messaging features, including sending messages and file sharing. The incident timeline involved the initial error, then a rapid escalation and eventual service failure. The remediation involved rolling back the faulty Istio configuration to a stable version. Slack's monthly uptime SLA was impacted, resulting in significant credits for enterprise customers. To prevent future incidents, Slack implemented new pre-deployment validation, alerts, and enhanced testing. They also published improved troubleshooting runbooks and adopted a canary rollout approach for future changes. A comprehensive post-incident review was conducted, sharing learnings across all engineering teams. Slack is committed to improving service mesh practices to minimize downtime for all users.