Building Pinterest Canvas, a text-to-image foundation model
Pinterest Canvas is a text-to-image model trained on over 1.5 billion high-quality text-image pairs to generate visually appealing images. The base model is fine-tuned to generate photorealistic backgrounds for products, using a two-stage training process that involves inpainting and preserves object boundaries. Conditioning images are used to guide the generation process, with Unified Visual Embedding (UVE) proving particularly effective in influencing the resulting outputs. The model is enhanced with IP-Adapter to process additional image prompts, allowing it to generate backgrounds in specific visual styles. Future improvements include upgrading to a Transformer diffusion architecture, exploring soft-masking approaches, and incorporating Pinterest-optimized visual embeddings for improved text conditioning. Pinterest Canvas enables the visualization of products in new contexts and enhances existing images and products on the platform.