EC2 G7e: Architecture Decision... Note

EC2 G7e: Architecture Decision for Generative Video Inference

This document outlines an architectural decision regarding GPU instance selection for generative video inference in financial environments. Generative video inference presents unique challenges compared to image inference due to temporal state, GPU memory bandwidth, and strict latency requirements. Models require significant VRAM, with clip duration and resolution directly increasing memory consumption. The new EC2 G7e instances with NVIDIA L40S GPUs, offering 48 GB of VRAM, address this need by eliminating CPU offloading. Key architectural forces influencing this decision include cost per token versus hourly cost, regional availability and data residency regulations, tenant isolation requirements, and cold start times. When evaluating options like G5, G6, and G7e instances, G7e emerges as the preferred choice for production workloads with latency SLOs under 90 seconds for 720p-1080p video. Amazon Bedrock is recommended as a managed fallback for spikes and regions lacking G7e availability. The proposed architecture uses EKS with Karpenter for G7e orchestration and a warm pool strategy to mitigate cold starts. Security and compliance are addressed through tenant-specific encryption, IRSA for pod IAM, prompt injection protection, and comprehensive auditability. The G7e's 48 GB VRAM and improved memory bandwidth mark a significant advancement, enabling faster inference times and compliance with stringent latency requirements.