From massive models to mobile ... Note

From massive models to mobile magic: The tech behind YouTube real-time generative AI effects

YouTube Shorts aims to provide magical, real-time effects for creators by applying advanced generative AI on mobile devices. This is achieved by distilling large AI models into smaller, task-specific ones that can run efficiently frame-by-frame on phones. The process begins with curating diverse and high-quality facial datasets, ensuring inclusivity across demographics. A key technique is knowledge distillation, using a powerful "teacher" model and a lightweight "student" model. The teacher, initially StyleGAN2 and later models like Imagen, performs complex generation, while the student, built with a UNet and MobileNet, is optimized for mobile. Training involves generating image pairs from the teacher and training the student with specific loss functions and neural architecture search. A critical challenge is preserving user identity, addressed by a technique called pivotal tuning inversion (PTI). PTI fine-tunes a generator to a specific face, allowing edits in latent space without altering likeness. The on-device solution uses Google's MediaPipe framework for face detection, alignment, and seamless integration of the student model. The pipeline achieves real-time performance, operating faster than 33 milliseconds per frame for a smooth user experience. This technology has powered numerous popular YouTube Shorts features since 2023, enhancing creative possibilities. The team continues to innovate, aiming to integrate newer models and reduce latency for broader device accessibility.
CdXz5zHNQW_8wndZAykAn.gif