Hex-LLM: High-efficiency large language model serving on TPUs in Vertex AI Model Garden

Google Cloud's Vertex AI Model Garden aims to optimize machine learning workflows with over 150 models, including first-party, open-source, and third-party models. Last year, they introduced the vLLM serving stack on GPUs, and now they unveil Hex-LLM, optimized for Cloud TPUs using XLA. Hex-LLM enhances efficiency and cost-effectiveness in serving large language models (LLMs) by incorporating state-of-the-art technologies like continuous batching and paged attention. It supports a range of popular dense and sparse LLM models and offers high throughput and low latency. Key optimizations include a token-based continuous batching algorithm, a rewritten PagedAttention kernel, and flexible data and tensor parallelism strategies. Benchmarking with the ShareGPT dataset showed impressive performance metrics, with models like Gemma 7B and Llama 2 70B delivering competitive results on TPU v5e chips. Users can easily deploy Hex-LLM via Vertex AI Model Garden’s playground, one-click deployment, or Colab Enterprise Notebooks. This flexibility allows customization to handle varying traffic needs, making Hex-LLM a powerful tool for efficient LLM serving on Google's TPU hardware.

cloud.google.com

RSS Hunter

2024-07-26

Create attached notes ...