Pinterest Engineering | Medium Note

Pinterest Engineering | Medium

Pinterest Engineering, showcased on Medium, provides a behind-the-scenes look at the technological innovations driving the popular visual discovery platform. Through in-depth articles, engineers share insights into their work on scalability, machine learning, data infrastructure, and more. The publication highlights Pinterest's engineering culture, emphasizing collaboration, experimentation, and a passion for solving complex problems. Readers can explore topics like building recommendation systems, optimizing search functionality, and developing tools for data analysis. The content offers valuable perspectives for engineers and tech enthusiasts interested in the intricacies of a large-scale platform like Pinterest. Whether delving into the challenges of image recognition or the evolution of their infrastructure, Pinterest Engineering on Medium provides a fascinating glimpse into the technical side of a beloved online destination.

Thread Of Notes

CdXz5zHNQW_MsiwuAmomZ.png
Pinterest has developed a robust, automated schema evolution framework for their Kafka-based CDC ingestion platform. Schema changes are a critical, cross-system contract, and unchecked evolution can lead to pipeline failures and data inconsistencies. Their solution focuses on making schema evolution safe, repeatable, and scalable by treating it as a multi-stage convergence process. The architecture involves CDC sources, Kafka, Flink for transformation, and Spark for upserts into Iceberg tables.A core component is a reliable onboarding model that uses schema definition files with stable numeric identifiers as the source of truth. Updates propagate automatically across Kafka, Flink, Spark, and Iceberg through a PR-based rollout with versioning and auditing. The system supports primarily additive schema changes to maintain backward compatibility and minimize complexity. Type changes are strictly limited to those preserving semantic meaning, like numeric precision widening.Schema evolution is managed through a three-phase convergence model to maintain pipeline availability. Phase one updates Iceberg schemas, phase two deploys updated Flink and Spark code, and phase three ensures data convergence. This phased approach decouples schema propagation from data correctness, allowing temporary divergence within a defined SLA. Pinterest employs an SLA-based model for schema evolution, prioritizing predictability and operational safety.Deployment strategies are carefully managed, especially for Flink, to prevent data loss. Unsupported or ambiguous cases, such as default values or primary key changes, have specific manual recovery paths. Ambiguous CREATE TABLE diffs are resolved by comparing against the database's actual DDL history rather than inferring intent from textual changes. Concurrent schema changes are handled sequentially to prevent race conditions, ensuring serialized convergence. Column transformations are managed by annotating schemas with required transformations, which are then injected into the ingestion pipeline. Error handling and recovery mechanisms, particularly for Spark failures, ensure that processing resumes from the last successful watermark.
CdXz5zHNQW_srAU1TSiiq.png
CdXz5zHNQW_AFL9DXaCyE.png
CdXz5zHNQW_rmxcXIRNOK.png
Pinterest's online ML serving system uses a root-leaf architecture where client services request scores for Pins. The root component handles feature retrieval and preprocessing, while leaves perform model inference, often on GPUs. This design simplifies onboarding new models and optimizes resource utilization by separating CPU and GPU workloads. However, it led to a network bottleneck between the root and leaf partitions due to passing many features.Initially, lz4 compression was implemented to reduce network usage, resulting in significant bandwidth savings but with a slight increase in CPU usage and latency. This was a good start, but the core issue of shipping unnecessary features persisted. The "Send What You Use" approach was then developed to address this by only sending features that a specific model requires.The model signature, which defines a model's inputs and outputs, serves as the source of truth for feature requirements. As models are trained and exported, their signatures are saved alongside them. Leaften load these signatures to build feature converters that process only the necessary features.To synchronize feature requirements between the root and leaves, model signatures are published as lightweight artifacts. These signatures are aggregated into bundle-level mappings, which are then deployed to the root alongside existing configurations. This deployment follows the same staged delivery process as model rollouts, ensuring consistency and enabling graceful rollbacks.This integration allows the Feature Trimmer to dynamically update feature allowlists on the root, ensuring that only essential features are transmitted. The system is designed to handle frequent model updates and gradual rollouts by using versioned lookups and fallback mechanisms. This ensures that the root's view of required features stays synchronized with the actual models deployed on the leaves. By trimming unneeded features, Pinterest significantly reduced network traffic and improved infrastructure efficiency.
CdXz5zHNQW_Pr67hugpQp.png
Pinterest developed a dedicated candidate generation model for conversion ads to address challenges with offsite conversion data sparsity and noise. This model differs from previous engagement-based systems by focusing on lower-funnel conversions. The initial launch in 2023 yielded significant improvements in both conversion and engagement metrics, including a higher clickthrough rate. Further iterations in 2025 delivered even greater conversion value and enhanced advertiser return on ad spend. To combat data sparsity, the model is trained across all shopping surfaces using a multi-surface approach. It supplements primary conversion signals with onsite engagement data, re-weighting click data based on duration to mitigate noise. Harder negatives, such as ad impressions with no engagement, are used for more robust contrastive learning. The model incorporates user-side features capturing real-time intent and long-term preferences, alongside Pin-side features for semantic understanding and performance tracking. A two-tower architecture with DCN v2 and an MLP in parallel cross layers enhances feature interaction modeling and retrieval quality. The model evolved from a multi-head design to a unified multi-task architecture, allowing direct benefit from multi-task optimization during serving. An advertiser-level loss function was introduced to provide a more stable granularity for conversion signals, leading to substantial recall improvements. This new model successfully increased shopping conversion volume and improved advertiser performance while enhancing the user shopping experience.
CdXz5zHNQW_iZkUUBsGZ2.png
Pinterest uses content understanding to drive distribution and engagement, requiring insight into images and outbound links. The core problem is URL normalization, where identical product pages appear under varied URLs due to tracking parameters. This redundancy leads to wasted computational resources through repeated fetching and processing. Item canonicalization aims to unify identical items represented by different URLs, crucial for shopping catalogs. When item IDs are absent, advanced URL normalization is vital for deduplication.The Minimal Important Query Param Set (MIQPS) algorithm automatically learns which URL parameters influence content identity. It distinguishes between neutral parameters, which don't affect page content, and non-neutral parameters, which do. While static rules work for well-known platforms, Pinterest's vast domain set requires a dynamic, data-driven approach.The MIQPS algorithm operates in three steps. First, it collects a corpus of observed URLs per domain from Pinterest's ingestion pipeline. Second, URLs are grouped by their query parameter pattern, ensuring parameters are analyzed in their specific context. This prevents misclassifying a parameter based on a different URL type.Finally, for each parameter within a pattern, the algorithm empirically tests its importance. It samples URLs with distinct parameter values and computes content IDs for both the original and modified (parameter-removed) URLs. If removing the parameter changes the content ID in a significant percentage of samples, it's classified as non-neutral and retained. Otherwise, it's deemed neutral and can be safely stripped for normalization. Each merchant domain receives its own MIQPS map, accounting for domain-specific parameter meanings.
CdXz5zHNQW_WVip85jMBw.png
CdXz5zHNQW_RcLxSqw9JO.png
CdXz5zHNQW_acxjx5IRwX.png
Large online platforms face the challenge of organizing billions of items into navigable shopping collections. Historically, these collections relied on user search history and manual curation. However, multimodal large language models (LLMs) now enable generating collections directly from content, while still considering user search patterns. This paper introduces Pinlanding, a production pipeline for shopping collection generation. Pinlanding comprises four components: understanding user search intent, building a shopping collection vocabulary using LLMs, constructing feeds from attributes, and evaluating/evolving the system. User interaction data helps characterize shopping intents, revealing both high-volume searches and emerging long-tail conversational queries. A vision-language model generates initial product attributes, which are then curated into a compact vocabulary using statistical filtering, embedding-based clustering, and LLM-assisted review. A CLIP-style dual-encoder model is trained for scalable attribute assignment, efficiently mapping products to attributes. Ray is used for scalable batch inference in attribute assignment, and Spark constructs feeds by scoring product-topic relevance. The CLIP-based classifier shows superior performance on a fashion attribute prediction benchmark. Human evaluation demonstrates that Pinlanding significantly improves precision in collection quality compared to traditional methods. The system has led to a four-fold increase in unique shopping topics and a 35% improvement in search performance. Future work involves integrating social trends and developing an AI-agent layer to handle emergent composite concepts.
CdXz5zHNQW_plPICGLX7O.png
Pinterest Search developed a method to enhance search relevance evaluation using Large Language Models (LLMs). Traditional relevance measurement relied on costly human annotations, limiting the scale and sensitivity of A/B experiments. To address this, they fine-tuned open-source LLMs on human-labeled data to predict Pin relevance to queries. This LLM-based approach treats relevance prediction as a multiclass classification problem, utilizing features like Pin titles, descriptions, and image captions.They adopted a stratified query sampling design, which significantly reduces the Minimum Detectable Effect (MDE) by an order of magnitude. This new methodology enables the measurement of heterogeneous treatment effects and improves evaluation efficiency. The LLM labeling process significantly lowers costs and time, allowing for larger and more representative sample sizes.After fine-tuning, the LLM-based relevance model generates relevance scores, which are then used to compute metrics like sDCG@K. Rigorous validation showed high alignment between LLM-generated labels and human annotations, with an exact match rate of 73.7% and strong rank-based correlations. This alignment holds even for queries of different popularity segments.The LLM-based relevance assessment proved effective for non-English queries as well, maintaining strong correlations and low bias. By transitioning to LLM-based relevance assessment, Pinterest Search has been able to scale up their evaluation query sets and improve the quality of relevance metrics for online experiment evaluation. This has led to a significant reduction in manual annotation efforts and enhanced the overall efficiency of their A/B testing process. The chosen LLM, XLM-RoBERTa-large, offers a good balance of prediction quality and inference efficiency.
CdXz5zHNQW_fbv8G1VHoa.png
Pinterest uses a metric called prevalence to measure policy-violating content, defined as the percentage of all views that went to harmful content. Prevalence complements user reports by identifying under-reported harms and tracking trends. Historically, reliance on human review for measuring prevalence was slow and expensive. To address this, Pinterest developed an AI-assisted workflow for daily prevalence measurement. This involves sampling user impressions and using a multimodal LLM for large-scale labeling. The LLM, guided by expert prompts and subject matter experts, significantly reduces latency and cost while maintaining accuracy. Prevalence is calculated daily, with confidence intervals, and can be broken down by policy areas, sub-policies, and content surfaces. The system uses risk scores from enforcement models for efficient sampling, but these scores do not act as labels. Inverse-probability weighting ensures the prevalence statistic accurately reflects user impressions over time, even with enforcement threshold changes. Machine learning is crucial for unbiased sampling and efficient labeling, allowing for faster risk detection and proactive responses. This data-driven approach enables quicker product iterations, informed policy development, and strategic decision-making, including setting goals and allocating resources effectively. Challenges like wide confidence intervals for rare categories or policy drift are managed through adaptive sampling and continuous monitoring. Future plans include expanding pivoting capabilities, optimizing LLM usage, and refining human-in-the-loop processes for enhanced accuracy and reduced bias.
Android end-to-end testing builds at Pinterest were slow and unreliable due to unbalanced test shards and platform limitations. The team first evaluated third-party solutions but found them inadequate for their needs. They decided to build an in-house testing platform called PinTestLab, hosted on EC2 emulators. This platform allowed for complete control over the testing stack and infrastructure.The core innovation is a runtime-aware sharding mechanism. This system uses historical test duration and stability data to pack tests into shards. The goal is to ensure that each shard has a similar total runtime. This approach differs from simply balancing the number of tests per shard.Previously, package-based sharding led to imbalances where a single slow shard would delay the entire build. Even simple time-based sorting failed to account for emulator idle time. The new runtime-aware sharding algorithm works by sorting tests by average runtime and then greedily assigning each test to the emulator projected to finish earliest. This keeps all emulators busy and minimizes the time difference between the fastest and slowest shards.The impact of this solution has been significant. End-to-end build times were reduced by nine minutes, a 36% improvement. The runtime of the slowest shard decreased by 55%. The time difference between the fastest and slowest shards was dramatically compressed from 597 seconds to just 130 seconds. This boosts developer velocity by providing faster and more reliable feedback.
CdXz5zHNQW_7VB873V6rz.png
Pinterest's ML training platform, MLEnv, encountered a significant performance drop after a PyTorch version upgrade. This issue led to a more than 50% reduction in training throughput. The debugging process began by examining the GPU roofline throughput. This measurement revealed a 20% performance decrease even when excluding the data loader. Further analysis focused on individual model modules to pinpoint the source of the slowdown. A specific transformer module, module A, was identified as the primary culprit. The PyTorch profiler showed that CompiledFunctions, previously present, were now missing for this module in the upgraded version.Investigation into torch.compile revealed a log indicating that a non-infrastructure PyTorch dispatch mode was present, which torch.compile did not support. Minimal reproducible scripts confirmed that this issue manifested specifically within the trainer class. The problematic component was identified as a context manager used for FLOPs counting, enabled by default. Disabling this context manager resolved the torch.compile issue, restoring CompiledFunctions. However, this fix did not improve end-to-end throughput.The focus shifted back to the data loading and distributed training aspects, ruling out Ray.data as the cause by observing the same GPU roofline throughput issues even when running as a native PyTorch application. Several observations pointed to intermittent slow iterations, a straggler effect during synchronization, and a peculiar behavior where enabling Nvidia's Nsight Systems profiler eliminated the slowness. Testing on a single GPU confirmed distributed training was not the root cause. Disabling torch.compile entirely in the Ray setup restored original throughput, suggesting that graph breaks within torch.compile were related to the slowdowns.Creating a minimal reproducible model with extensive graph breaks led to the observation of recurring slow iterations. Nsight Systems traces revealed that the main training thread was holding the Global Interpreter Lock (GIL) during these slow iterations, but this did not explain the entire pause. Further analysis using the Linux perf tool and visualizing the traces with chrome://tracing highlighted a suspicious Python process. This process was executing an expensive computation, specifically a Linux kernel call named smap_gather_stats, which gathers virtual memory statistics.
CdXz5zHNQW_ahqFK2Jga1.png
CdXz5zHNQW_JMlsyqEFEB.png
CdXz5zHNQW_dc6w46JhEJ.png
Pinterest's Data Engineering team is building a new massive scale data processing platform to replace their current Hadoop-based platform, Monarch. The team explored Kubernetes-based systems as a replacement due to their growing popularity and increasing adoption in the Big Data community. The new platform had to meet certain criteria, including extensive support for containers, execution of Pinterest's custom Spark fork, and lower operational and maintenance costs. The team performed a comprehensive evaluation of running Spark on various platforms and leaned towards Kubernetes-focused frameworks due to their advantages, including container-based isolation and security, ease of deployment, and built-in frameworks. Kubernetes provides more fine-grained support for container management and deployment than other systems, but lacks built-in support for data management, storage, and processing. The team's current deployment model in Hadoop is cumbersome, and they are moving towards a more straightforward approach using Terraform, container images, and Helm. The new platform will leverage Kubernetes and EKS to replace Monarch, introducing several challenges, including integrating EKS into the existing Pinterest environment and finding replacements for Hadoop components. The team has built a new platform, Moka, which is able to process batch Spark workloads that only access non-sensitive data, and will add more functionality in the future. The initial high-level design of Moka includes a system that can process batch Spark workloads, with jobs submitted and processed through a series of components, including Spinner, Archer, and the Spark Operator. The team will provide more details on the core application-focused aspects of their platform in the next part of their blog series.
CdXz5zHNQW_bbfzbQhJcm.png
CdXz5zHNQW_vBYaC4X7rO.png
CdXz5zHNQW_OYKi1HZH8r.png
CdXz5zHNQW_kxoAAaP5LS.jpeg
CdXz5zHNQW_I5dnAJn3pO.png
Pinterest is a unique platform where users, known as Pinners, come to find inspiration and ideas for various aspects of their lives. The platform's goal is to provide a personalized experience, showing users content that is relevant to their interests and searches. Pinterest's approach to personalization is different from other platforms, as it prioritizes quality time over time spent on the platform. The company believes that a balance between different approaches to content ranking is necessary, incorporating explicit engagement signals, community guidelines, and survey-based personalization. Pinterest uses surveys to gather feedback from users and create a healthier and more inspirational experience. The platform's surveys are designed to be rigorous and effective, with a team of experts ensuring that the surveys are well-designed and useful. The surveys have been instrumental in helping Pinterest create a positive and inspirational experience for users, with recent research showing that the platform leads the industry in terms of its impact on user wellbeing. Pinterest's approach to personalization is guided by the principles of the Inspired Internet Pledge, which calls for companies to prioritize user wellbeing and create a healthier internet experience. By using surveys and prioritizing user wellbeing, Pinterest is proving that it is possible to create a safer and healthier online experience. Overall, Pinterest's unique approach to personalization and its commitment to user wellbeing set it apart from other social media platforms.
CdXz5zHNQW_xMjUEWeEAQ.png
CdXz5zHNQW_P7QrX8S0r6.png
CdXz5zHNQW_YKhzxUkvad.png
CdXz5zHNQW_HsQvhLZ0Fu.png
CdXz5zHNQW_Eh609KFgJk.png
CdXz5zHNQW_u2LVnNpu5X.png
Pinterest, a visual search engine, runs on AWS and uses Amazon EC2 instances for its compute fleet. The company identified a significant challenge in managing its EC2 infrastructure, particularly for its online storage systems, due to a lack of clear insights into EC2's network performance and its impact on application reliability and performance. To address this, Pinterest developed network performance monitoring for its EC2 fleet and implemented techniques to manage network bursts, ensuring dependable network performance for critical online serving workloads. The company experienced issues with user sequence serving, which drove significant user engagement wins but resulted in serving latency and application timeouts. During an EC2 instance migration, Pinterest saw significant performance degradation across many clusters, leading to application timeouts. The company discovered that EC2 instances were experiencing network throttling due to microbursts that exceeded the network allowance. To make EC2 network throttling behavior more transparent, Pinterest upgraded its instances to access raw counters on an EC2 instance using tools like ethtool. The company modified its internal metrics collection agent to scrape these counters and ingest them into its metrics storage. By rolling out these ENA metrics to its entire EC2 fleet, Pinterest gained unprecedented visibility into AWS traffic shaping and implemented various optimizations to mitigate network throttling. The company also explored techniques to handle network bursts, including fine-grained S3 rate limiting, data backup tuning, and network compression.
CdXz5zHNQW_DGfWhUSLvs.jpeg
Pinterest Search is a key surface where users can discover inspiring content that aligns with their information needs, and search relevance measures how well the search results align with the search query. To improve the search relevance model, a 5-level guideline is used to measure the relevance between queries and Pins. A cross-encoder language model is used to predict a Pin's relevance to a query, along with Pin text, and the task is formulated as a multiclass classification problem. The model is fine-tuned using human-annotated data, minimizing cross-entropy loss.To represent each Pin, a varied set of text features is used, including Pin titles and descriptions, synthetic image captions, high-engagement query tokens, user-curated board titles, and link titles and descriptions. However, the cross-encoder LLM-based classifier is hard to scale for Pinterest Search due to real-time latency and cost considerations. Therefore, knowledge distillation is used to distill the LLM-based teacher model into a lightweight student relevance model.The student model uses query-level features, Pin-level features, and query-Pin interaction features to predict 5-scale relevance scores. Knowledge distillation and semi-supervised learning are employed to train the student model, which makes effective use of vast amounts of initially unlabeled data and expands the data to a wide range of languages from around the world.Offline experiments demonstrate the effectiveness of each modeling decision, including the comparison of language models, the importance of enriching text features, and scaling up training labels through distillation. Online results show a +2.18% improvement in search feed relevance, as measured by nDCG@20, and a significant uptick in search fulfillment rates globally.The proposed relevance modeling pipeline effectively generalizes across languages not encountered during training, and the multilingual LLM-based relevance teacher model generalizes across unseen languages. Future work will explore the integration of servable LLMs, vision-and-language multimodal models, and active learning strategies to dynamically scale and improve the quality of the training data.
CdXz5zHNQW_lGLj8VappE.png