LLM-Powered Relevance Assessme... Note

LLM-Powered Relevance Assessment for Pinterest Search

Pinterest Search developed a method to enhance search relevance evaluation using Large Language Models (LLMs). Traditional relevance measurement relied on costly human annotations, limiting the scale and sensitivity of A/B experiments. To address this, they fine-tuned open-source LLMs on human-labeled data to predict Pin relevance to queries. This LLM-based approach treats relevance prediction as a multiclass classification problem, utilizing features like Pin titles, descriptions, and image captions.They adopted a stratified query sampling design, which significantly reduces the Minimum Detectable Effect (MDE) by an order of magnitude. This new methodology enables the measurement of heterogeneous treatment effects and improves evaluation efficiency. The LLM labeling process significantly lowers costs and time, allowing for larger and more representative sample sizes.After fine-tuning, the LLM-based relevance model generates relevance scores, which are then used to compute metrics like sDCG@K. Rigorous validation showed high alignment between LLM-generated labels and human annotations, with an exact match rate of 73.7% and strong rank-based correlations. This alignment holds even for queries of different popularity segments.The LLM-based relevance assessment proved effective for non-English queries as well, maintaining strong correlations and low bias. By transitioning to LLM-based relevance assessment, Pinterest Search has been able to scale up their evaluation query sets and improve the quality of relevance metrics for online experiment evaluation. This has led to a significant reduction in manual annotation efforts and enhanced the overall efficiency of their A/B testing process. The chosen LLM, XLM-RoBERTa-large, offers a good balance of prediction quality and inference efficiency.
CdXz5zHNQW_fbv8G1VHoa.png