Ray Infrastructure at Pinterest
In 2023, Pinterest began integrating Ray into its infrastructure, encountering challenges such as limited K8s API access, ephemeral logging and metrics, and authentication requirements. To address these, Pinterest developed a Ray Cluster Controller and API Gateway to manage Ray Cluster provisioning and handle authentication. They also built a dedicated user interface for persistent logging and metrics, and integrated with their in-house time series database (Goku) for metrics visualization. Pinterest provides multiple development interfaces for Ray applications, including Jupyter, Dev server, and Spinner workflow. Unit and integration testing frameworks are offered for application development and testing. Offline data analysis is enabled by exporting cluster metrics to a big data format for offline analysis. Pinterest's Ray Infrastructure incorporates best practices from Ray and addresses unique company requirements, including security, traffic settings, and service integrations. The platform provides centralized control over Ray Cluster management and streamlines the process for users.