Smarter URL Normalization at S... Note

Smarter URL Normalization at Scale: How MIQPS Powers Content Deduplication at Pinterest

Pinterest uses content understanding to drive distribution and engagement, requiring insight into images and outbound links. The core problem is URL normalization, where identical product pages appear under varied URLs due to tracking parameters. This redundancy leads to wasted computational resources through repeated fetching and processing. Item canonicalization aims to unify identical items represented by different URLs, crucial for shopping catalogs. When item IDs are absent, advanced URL normalization is vital for deduplication.The Minimal Important Query Param Set (MIQPS) algorithm automatically learns which URL parameters influence content identity. It distinguishes between neutral parameters, which don't affect page content, and non-neutral parameters, which do. While static rules work for well-known platforms, Pinterest's vast domain set requires a dynamic, data-driven approach.The MIQPS algorithm operates in three steps. First, it collects a corpus of observed URLs per domain from Pinterest's ingestion pipeline. Second, URLs are grouped by their query parameter pattern, ensuring parameters are analyzed in their specific context. This prevents misclassifying a parameter based on a different URL type.Finally, for each parameter within a pattern, the algorithm empirically tests its importance. It samples URLs with distinct parameter values and computes content IDs for both the original and modified (parameter-removed) URLs. If removing the parameter changes the content ID in a significant percentage of samples, it's classified as non-neutral and retained. Otherwise, it's deemed neutral and can be safely stripped for normalization. Each merchant domain receives its own MIQPS map, accounting for domain-specific parameter meanings.
CdXz5zHNQW_WVip85jMBw.png