Towards Data Science | Medium

Can AI Agents Do Your Day-to-Day Tasks on Apps?

Benchmarking coding agents for personal apps requires a rich execution environment, complex tasks, and a reliable evaluation framework. Existing benchmarks fall short in meeting these criteria. AppWorld introduces a simulated world environment where coding agents can interact with various apps via APIs on behalf of people. The benchmark consists of 750 day-to-day tasks requiring rich and interactive coding. A robust evaluation framework assesses agent performance based on database state changes, ensuring task completion and absence of collateral damage. Despite the capabilities of LLMs, GPT-4o completes only 30% of challenge tasks correctly, highlighting the difficulty of the benchmark. Scores decrease with increasing task difficulty. AppWorld provides a foundation for future research in automating digital tasks, including extending the benchmark to multi-agent collaboration, overlaying it with a physical world engine, and studying potential risks of digital assistants operating autonomously. AppWorld is easy to use and encourages researchers to explore its possibilities.
favicon
towardsdatascience.com
towardsdatascience.com
Create attached notes ...