Debugging the One-in-a-Million... Note

Debugging the One-in-a-Million Failure: Migrating Pinterest’s Search Infrastructure to Kubernetes

Pinterest's search infrastructure, Manas, was migrated to Kubernetes, but a performance issue was discovered where one in every million search requests took 100 times longer than usual. The issue was investigated, and it was found that a monitoring process, cAdvisor, was causing the problem. cAdvisor was scanning the entire page table every 30 seconds to calculate the total bytes of memory referenced by a process, which was causing contention with the memory-intensive leaf processing in Manas. This was causing the latency spikes in the search requests. The investigation involved profiling search systems, debugging performance issues, Linux kernel features, and memory management. The root cause was identified as cAdvisor's working set size (WSS) estimation, which was enabled by default and was causing the memory contention. The issue was resolved by disabling cAdvisor's WSS estimation for all PinCompute nodes. This fix was a major milestone for Pinterest's Kubernetes platform, allowing other online services to be moved to the platform. The investigation highlighted the importance of resource isolation, narrowing the problem space, and using blackbox debugging strategies. The experience also showed that sometimes, a good enough solution is sufficient, and it's not necessary to find an exact solution to move forward.
CdXz5zHNQW_dc6w46JhEJ.png