Evolving Ray and Kubernetes together for the future of distributed AI and ML

Google Cloud and Anyscale are enhancing the integration of Ray, an OSS compute engine for AI, with Kubernetes, particularly Google Kubernetes Engine. Label selectors have been introduced to Ray, mirroring Kubernetes functionality, to improve scheduling flexibility for distributed tasks and actors. This allows developers to assign labels to nodes and specify resource requirements like accelerator types for task execution. Combining Ray and Kubernetes label selectors on GKE offers granular control over application deployment and infrastructure. Advanced accelerator support is also improved, enabling the use of next-generation AI accelerators like NVIDIA GB200 NVL72 with Ray on GKE via Dynamic Resource Allocation. Furthermore, Ray is getting a more native support for TPUs, including a JAXTrainer API for streamlined TPU training. Writable cgroups are now available on GKE for Ray clusters, enhancing reliability by allowing Ray to dynamically reserve resources for critical system tasks within containers. This feature improves the dependability of Ray clusters even under intense workloads without compromising security. The introduction of in-place pod resizing in Kubernetes v1.33 marks the beginning of vertical autoscaling capabilities for Ray on Kubernetes. This feature can boost workload efficiency by allowing Ray workers to scale their resources faster and more flexibly. Ray and Kubernetes are working together to create a powerful distributed operating system for AI/ML workloads.

bsky.app

AI and ML News on Bluesky @ai-news.at.thenote.app

cloud.google.com

RSS Hunter

2025-11-03

Create attached notes ...