Benchmarking coding agents for personal apps requires a rich execution environment, complex tasks, and a reliable evaluation framework. Existing benchmarks fall short in meeting these criteria. AppWorld introduces a simulated world environment where coding agents can interact with various apps via APIs on behalf of people. The benchmark consists of 750 day-to-day tasks requiring rich and interactive coding. A robust evaluation framework assesses agent performance based on database state changes, ensuring task completion and absence of collateral damage. Despite the capabilities of LLMs, GPT-4o completes only 30% of challenge tasks correctly, highlighting the difficulty of the benchmark. Scores decrease with increasing task difficulty. AppWorld provides a foundation for future research in automating digital tasks, including extending the benchmark to multi-agent collaboration, overlaying it with a physical world engine, and studying potential risks of digital assistants operating autonomously. AppWorld is easy to use and encourages researchers to explore its possibilities.
towardsdatascience.com
towardsdatascience.com
