Pinterest created a new database ingestion framework to replace its slow, batch-oriented legacy system. The new framework uses Change Data Capture (CDC), Kafka, Flink, and Spark to ingest data in near real-time. This design offers lower latency and better efficiency than older methods. The system ingests changes from databases like MySQL and TiDB into CDC tables. Flink streams process these CDC events and store them in Iceberg tables. Spark jobs then periodically merge changes from CDC tables into base tables using "Merge Into" statements. Key optimizations include partitioning base tables and using bucket joins for efficiency. These techniques reduce compute costs and improve the speed of the upsert operations. The team standardized on the Merge-on-Read (MOR) approach for its advantages. The framework supports row-level deletions and provides native data compliance. Future work will focus on automated schema evolution within the framework.
medium.com
medium.com
