Initially, platform upgrades for Kafka brokers were tedious and time-consuming, requiring hours of manual monitoring and waiting. To improve this, a multizone architecture was implemented, allowing multiple brokers to be updated simultaneously without affecting availability. However, Kubernetes' native rolling update strategy was not suitable due to the zonal distribution of replicas.
Custom logic was developed to control updates, allowing multiple brokers in a zone to be restarted concurrently. This was implemented as a Kubernetes batch job to ensure reliability and prevent accidental deployment issues.
Testing in production showed that with a parallelism of three, upgrades could be completed in about two hours. While restarting all brokers in a zone simultaneously was technically possible, it was avoided to prevent increased load on remaining brokers.
The multizone architecture and custom update logic significantly reduced upgrade time, from seven hours to around two hours. This improvement not only saved time but also reduced the toil and stress associated with upgrades.
The new process ensured quick and efficient upgrades, with minimal impact on the Kafka cluster. The project's success was measured not only by reduced duration but also by the ease and peace of mind it provided during upgrades.
etsy.com
etsy.com
