Drastically Reducing Out-of-Me... Note

Drastically Reducing Out-of-Memory Errors in Apache Spark at Pinterest

Pinterest, using Apache Spark, tackled frequent out-of-memory errors (OOMs) in its large-scale data processing environment. They introduced "Auto Memory Retries" to automatically retry tasks failing with OOM on executors with increased memory. The primary goal was to reduce on-call load and save costs associated with failing applications. The core idea involved assigning tasks with higher memory needs a specific resource profile. This custom Apache Spark version modifies the scheduling loop to retry tasks with larger memory profiles using a hybrid approach. This approach can increase the CPU per task, or launch physically larger executors if necessary. The implementation involved extending core Spark classes like Task and TaskSetManager and updating the SparkUI. They developed a comprehensive dashboard to monitor the impact, measuring cost savings and job recovery rates. The rollout was staged, starting with ad-hoc jobs and then gradually incorporating scheduled jobs in tiers. The results successfully reduced OOM errors and optimized resource utilization within the Spark cluster.
CdXz5zHNQW_Np2gePBjyR.png