Handling Network Throttling with AWS EC2 at Pinterest
Pinterest, a visual search engine, runs on AWS and uses Amazon EC2 instances for its compute fleet. The company identified a significant challenge in managing its EC2 infrastructure, particularly for its online storage systems, due to a lack of clear insights into EC2's network performance and its impact on application reliability and performance. To address this, Pinterest developed network performance monitoring for its EC2 fleet and implemented techniques to manage network bursts, ensuring dependable network performance for critical online serving workloads. The company experienced issues with user sequence serving, which drove significant user engagement wins but resulted in serving latency and application timeouts. During an EC2 instance migration, Pinterest saw significant performance degradation across many clusters, leading to application timeouts. The company discovered that EC2 instances were experiencing network throttling due to microbursts that exceeded the network allowance. To make EC2 network throttling behavior more transparent, Pinterest upgraded its instances to access raw counters on an EC2 instance using tools like ethtool. The company modified its internal metrics collection agent to scrape these counters and ingest them into its metrics storage. By rolling out these ENA metrics to its entire EC2 fleet, Pinterest gained unprecedented visibility into AWS traffic shaping and implemented various optimizations to mitigate network throttling. The company also explored techniques to handle network bursts, including fine-grained S3 rate limiting, data backup tuning, and network compression.