Google's 'Watch & Learn' framework cracks the data bottleneck for training computer-use agents

Google Cloud and DeepMind developed the Watch & Learn (W&L) framework to address the data bottleneck in training computer use agents (CUAs). W&L automatically extracts training data from raw videos, eliminating the need for costly human annotation. The framework uses an inverse dynamics objective to predict actions from consecutive observations. This approach involves training an inverse dynamics model (IDM) on state transitions created by agents interacting with web pages. The trained IDM then generates high-quality trajectories by analyzing videos, identifying UI actions. These trajectories can be used to either fine-tune existing models or create in-context learning (ICL) examples for CUAs. ICL examples are enhanced with reasoning annotations using models such as Gemini 2.5 Flash. Experiments on the OSWorld benchmark showed improvements across various model types, including fine-tuned and ICL models. W&L achieved these results without manual annotation, demonstrating the viability of web-scale workflows for CUA development. This approach can convert existing video resources into training data, facilitating the creation of bespoke CUAs for businesses. The system allows users to record tasks and have the IDM annotate them for training. The framework's success highlights the potential for further advancements in the CUA field.

venturebeat.com

RSS Hunter

2025-10-23

Create attached notes ...