The author initially relied on CPU-based scaling for their Auto Scaling Group, a common practice that often works well for web servers. However, background jobs, particularly those using Sidekiq, revealed a significant flaw in this approach. Critical background tasks were failing, and worker processes were unexpectedly crashing, despite low CPU utilization. The root cause was identified as the Linux Out-of-Memory (OOM) killer, which was terminating processes due to memory exhaustion. The default cloud monitoring system didn't track memory usage, so the system remained unaware of the issue while CPU usage stayed low. The solution involved installing the CloudWatch Agent to collect and send memory usage metrics. Utilizing the new metrics, the author then implemented a memory-based scaling policy for the Auto Scaling Group. This change ensured the system scaled predictively based on memory pressure, adding instances before the service crashed. As a result, the crashes stopped immediately, improving system stability and efficiency . Finally, the author highlighted how background workers often require memory-focused scaling compared to web servers.
dev.to
dev.to
