Automated Schema Evolution in Pinterest’s Next-Generation DB Ingestion Framework
Pinterest has developed a robust, automated schema evolution framework for their Kafka-based CDC ingestion platform. Schema changes are a critical, cross-system contract, and unchecked evolution can lead to pipeline failures and data inconsistencies. Their solution focuses on making schema evolution safe, repeatable, and scalable by treating it as a multi-stage convergence process. The architecture involves CDC sources, Kafka, Flink for transformation, and Spark for upserts into Iceberg tables.A core component is a reliable onboarding model that uses schema definition files with stable numeric identifiers as the source of truth. Updates propagate automatically across Kafka, Flink, Spark, and Iceberg through a PR-based rollout with versioning and auditing. The system supports primarily additive schema changes to maintain backward compatibility and minimize complexity. Type changes are strictly limited to those preserving semantic meaning, like numeric precision widening.Schema evolution is managed through a three-phase convergence model to maintain pipeline availability. Phase one updates Iceberg schemas, phase two deploys updated Flink and Spark code, and phase three ensures data convergence. This phased approach decouples schema propagation from data correctness, allowing temporary divergence within a defined SLA. Pinterest employs an SLA-based model for schema evolution, prioritizing predictability and operational safety.Deployment strategies are carefully managed, especially for Flink, to prevent data loss. Unsupported or ambiguous cases, such as default values or primary key changes, have specific manual recovery paths. Ambiguous CREATE TABLE diffs are resolved by comparing against the database's actual DDL history rather than inferring intent from textual changes. Concurrent schema changes are handled sequentially to prevent race conditions, ensuring serialized convergence. Column transformations are managed by annotating schemas with required transformations, which are then injected into the ingestion pipeline. Error handling and recovery mechanisms, particularly for Spark failures, ensure that processing resumes from the last successful watermark.
CREATE TABLEdiffs are resolved by comparing against the database's actual DDL history rather than inferring intent from textual changes. Concurrent schema changes are handled sequentially to prevent race conditions, ensuring serialized convergence. Column transformations are managed by annotating schemas with required transformations, which are then injected into the ingestion pipeline. Error handling and recovery mechanisms, particularly for Spark failures, ensure that processing resumes from the last successful watermark.