In 2018, Etsy migrated its Kafka brokers to Google Cloud Platform's Kubernetes Engine. Initially operating in a single zone, the team later redesigned the architecture for zonal resilience, distributing brokers across multiple zones with even distribution of partition replicas.
To achieve a zero-downtime migration, brokers were moved first by snapshotting disks and then recreating them in the correct zones. Partition relocation was manually handled using scripts and tools to minimize data movement and impact.
Post-migration testing in production demonstrated the effectiveness of the multizone design, with minimal disruption during a zone outage. While inter-zone network costs increased as expected, the benefits of automated zone resilience outweigh the costs.
The team is optimizing costs by leveraging Kafka's follower fetching feature and exploring additional approaches to reduce cross-zone traffic. Despite some cost increases, the benefits of zonal resilience are significant, justifying the investment.
The migration involved complex steps, including disk and Pod movement, partition relocation, and configuration adjustments. The team's careful planning and execution ensured zero downtime and data integrity throughout the process.
Etsy's experience highlights the importance of designing for resilience in critical services. By embracing zonal redundancy, the team mitigated the risks associated with single-zone failures and improved the stability and availability of their Kafka cluster.
The multizone architecture enables Etsy to handle increased production traffic and critical user-facing features, such as search indexing, with confidence.
The company's ongoing efforts to optimize costs demonstrate a commitment to balancing resilience with financial considerations.
The case study provides valuable insights into the challenges and strategies involved in migrating and operating a highly available Kafka cluster in a multi-zone cloud environment.
etsy.com
etsy.com
