Efficient Visual Representation Learning And Evaluation
Etsy utilizes computer vision to enhance user experience with features like visual search and visually similar recommendations. These features require efficient and expressive visual representations, obtained through machine learning models. Etsy initially employed EfficientNetB0 but switched to the more efficient EfficientFormer-l3 due to its superior performance and lower computational requirements. To further enhance efficiency, Etsy fine-tunes these pre-trained backbones and employs multitask learning, training the representations on multiple classification tasks simultaneously. The evaluation scheme involves three nearest neighbor retrieval tasks to track model progress and guide training. Etsy has also implemented an experimental evaluation scheme that leverages generative AI, bridging the gap between text-based queries and clicked image candidates. To ensure efficient inference for downstream tasks, Etsy utilizes a fast stable diffusion model that generates high-quality images with reduced memory consumption and latency. By employing these techniques, Etsy has optimized its visual representations for efficient and scalable use in various applications.