Zero-Downtime PyTorch Upgrade ... Note

Zero-Downtime PyTorch Upgrade in Production: Approaches, Pitfalls and Lessons

Pinterest upgraded its machine learning stack from PyTorch 2.1 to 2.6 to leverage new features and improve performance. This upgrade involved addressing challenges like outdated dependencies, breaking API changes, and TorchScript compatibility. They updated the Ubuntu DLAMI and CUDA versions to meet PyTorch 2.6 requirements. They encountered and resolved TorchScript initialization issues by disabling JIT profiling and disabling the fuser for TorchScript. They mitigated breaking API changes by introducing a compile-time macro to bridge versions. A time-windowed multi-stage rollout was adopted to minimize downtime and control production impact. Following the upgrade, they fixed DCGM metric loss issues by addressing a resource conflict. The update also involved resolving intermittent model deployment failures. These updates involved a transition to a new DLAMI, resolving conflicts, and adapting to changes. The ultimate goal was to ensure a smooth and reliable production transition.
CdXz5zHNQW_qKc4D1jJuD.png