AI & ML News

Hex-LLM: High-efficiency large language model serving on TPUs in Vertex AI Model Garden

Follow
Google Cloud's Vertex AI Model Garden aims to optimize machine learning workflows with over 150 models, including first-party, open-source, and third-party models. Last year, they introduced the vLLM serving stack on GPUs, and now they unveil Hex-LLM, optimized for Cloud TPUs using XLA. Hex-LLM enhances efficiency and cost-effectiveness in serving large language models (LLMs) by incorporating state-of-the-art technologies like continuous batching and paged attention. It supports a range of popular dense and sparse LLM models and offers high throughput and low latency. Key optimizations include a token-based continuous batching algorithm, a rewritten PagedAttention kernel, and flexible data and tensor parallelism strategies. Benchmarking with the ShareGPT dataset showed impressive performance metrics, with models like Gemma 7B and Llama 2 70B delivering competitive results on TPU v5e chips. Users can easily deploy Hex-LLM via Vertex AI Model Garden’s playground, one-click deployment, or Colab Enterprise Notebooks. This flexibility allows customization to handle varying traffic needs, making Hex-LLM a powerful tool for efficient LLM serving on Google's TPU hardware.
favicon
cloud.google.com
cloud.google.com