The author, Roman Isachenko, a member of the Computer Vision team at Yandex, discusses the development of a multimodal image search engine using Visual Language Models (VLMs). VLMs are a new frontier in computer vision that can solve various fundamental CV-related tasks in zero-shot and one-shot modes. The author explains the basics and training process for developing a multimodal neural network for image search and explores the design principles, challenges, and architecture that make it all possible.
VLMs typically have three main components: a text model (LLM), an image model (CNN or Vision Transformer), and an adapter that acts as a mediator between the LLM and image encoder. The adapter is the most exciting and important part of the model, as it facilitates the communication between the LLM and the image encoder. There are two types of adapters: prompt-based adapters and cross-attention-based adapters.
The author discusses the training process of VLMs, which involves two stages: pre-training and alignment. Pre-training involves linking the text and image modalities together and loading world knowledge into the model. There are three types of data used in pre-training VLMs: interleaved pre-training, image-text pairs pre-training, and instruct-based pre-training.
The author also discusses the methods for evaluating the quality of VLMs, which include calculating metrics on open-source benchmarks and comparing models using side-by-side (SBS) evaluations. The author's model is bilingual and can respond in both English and Russian, allowing for the use of English open-source benchmarks and SBS comparisons.
The author also shares the experience of adding multimodality to Neuro, an AI-powered search product, allowing users to ask questions using text and images. The pipeline architecture of Neuro is discussed, and the author explains how the process used to look like before the addition of multimodality.
The author concludes that VLMs are a powerful tool for developing multimodal image search engines and that the future of compound AI systems lies in the development of VLMs. The author's team has achieved a 28% more accurate multimodal image search engine using VLMs, and the author believes that this is just the beginning of the development of VLMs.
towardsdatascience.com
towardsdatascience.com
