From LLMs to image generation: Accelerate inference workloads with AI Hypercomputer

Google Cloud is leading the way in AI inference with its AI Hypercomputer, which is critical for scaling generative AI applications. The company has made significant updates to its inference capabilities, including the introduction of Ironwood, a new Tensor Processing Unit (TPU) designed for inference. Google's JetStream inference engine has been enhanced with new performance optimizations, including Pathways for ultra-low latency multi-host, disaggregated serving. MaxDiffusion, a reference implementation of latent diffusion models, delivers standout performance on TPUs for compute-heavy image generation workloads and now supports Flux, one of the largest text-to-image generation models. The latest performance results from MLPerf Inference v5.0 demonstrate the power and versatility of Google Cloud's A3 Ultra and A4 VMs for inference. Google Cloud is offering more choice when serving LLMs on TPU, further enhancing JetStream and bringing vLLM support for TPU. JetStream is an open-source, throughput- and memory-optimized inference engine that delivers standout price-performance with low-latency, high-throughput inference and community support. Google Cloud's AI Hypercomputer is enabling AI breakthroughs with integrated software frameworks and hardware accelerators. Customers like Osmos and JetBrains are using Google Cloud's TPUs and GPU instances to maximize cost-efficiency for inference at scale.

cloud.google.com

bsky.app

AI and ML News on Bluesky @ai-news.at.thenote.app

RSS Hunter

2025-05-09