Finding zombies in our systems: A real-world story of CPU bottlenecks
Pinterest's ML platform team encountered crashing Ray-based training jobs due to intermittent network connectivity issues, prompting an investigation by the PinCompute team. The investigation, spanning over three months, revealed that the failures correlated with ENA network driver resets on AWS EC2 instances. These resets, caused by CPU starvation, were linked to high system CPU usage. Initially, the team tried various solutions like using huge pages and memory allocators, all of which failed to resolve the issue. Oddly the issues were happening in only one of Pinterest's AWS availability zones. Profiling efforts using perf and mpstat identified instances of single CPU core saturation. A temporal profiling setup using perf revealed the culprit as a process that was sporadically consuming high CPU resources. The process was identified to be the zombie process. The discovery of zombies and their impact on CPU utilization and network driver performance led to a deeper understanding of the system bottlenecks.