Papers
arxiv:2407.18901

AppWorld: A Controllable World of Apps and People for Benchmarking Interactive Coding Agents

Published on Jul 26
ยท Submitted by akhaliq on Jul 29
#2 Paper of the day

Abstract

Autonomous agents that address day-to-day digital tasks (e.g., ordering groceries for a household), must not only operate multiple apps (e.g., notes, messaging, shopping app) via APIs, but also generate rich code with complex control flow in an iterative manner based on their interaction with the environment. However, existing benchmarks for tool use are inadequate, as they only cover tasks that require a simple sequence of API calls. To remedy this gap, we built AppWorld Engine, a high-quality execution environment (60K lines of code) of 9 day-to-day apps operable via 457 APIs and populated with realistic digital activities simulating the lives of ~100 fictitious users. We then created AppWorld Benchmark (40K lines of code), a suite of 750 natural, diverse, and challenging autonomous agent tasks requiring rich and interactive code generation. It supports robust programmatic evaluation with state-based unit tests, allowing for different ways of completing a task while also checking for unexpected changes, i.e., collateral damage. The state-of-the-art LLM, GPT-4o, solves only ~49% of our 'normal' tasks and ~30% of 'challenge' tasks, while other models solve at least 16% fewer. This highlights the benchmark's difficulty and AppWorld's potential to push the frontiers of interactive coding agents. The project website is available at https://appworld.dev/.

Community

Paper submitter

Thanks, @akhaliq , for posting! Here are all associated links:

๐Ÿ”— Website: https://appworld.dev
๐Ÿงญ Data (task, trajectories) explorer, playground: https://appworld.dev/task-explorer
๐Ÿ” API explorer: https://appworld.dev/api-explorer
๐Ÿ“Š Leaderboard: https://appworld.dev/leaderboard
๐Ÿฆ https://x.com/harsh3vedi/status/1818311843976233198
๐Ÿ’ฌ Blog: https://appworld.dev/blog
๐ŸŽฌ (TLDR) Video: https://appworld.dev/video
๐ŸŒŽ Code: https://github.com/stonybrooknlp/appworld
๐Ÿ PyPi: https://pypi.org/project/appworld/

This is an automated message from the Librarian Bot. I found the following papers similar to this paper.

The following papers were recommended by the Semantic Scholar API

Please give a thumbs up to this comment if you found it helpful!

If you want recommendations for any Paper on Hugging Face checkout this Space

You can directly ask Librarian Bot for paper recommendations by tagging it in a comment: @librarian-bot recommend

Sign up or log in to comment

Models citing this paper 0

No model linking this paper

Cite arxiv.org/abs/2407.18901 in a model README.md to link it from this page.

Datasets citing this paper 0

No dataset linking this paper

Cite arxiv.org/abs/2407.18901 in a dataset README.md to link it from this page.

Spaces citing this paper 0

No Space linking this paper

Cite arxiv.org/abs/2407.18901 in a Space README.md to link it from this page.

Collections including this paper 13