Web Bench: A New Benchmark for Evaluating Web Browsing Agents

2025-05-29
Web Bench: A New Benchmark for Evaluating Web Browsing Agents

Web Bench is a new dataset for evaluating web browsing agents, comprising 5,750 tasks across 452 websites, with 2,454 tasks open-sourced. The benchmark reveals shortcomings in existing agents' handling of write-heavy tasks (login, form filling, file downloads), highlighting the importance of browser infrastructure. Anthropic Sonnet 3.7 CUA achieved the highest performance. This research exposes the challenges in automating web interactions and paves the way for more robust AI agents.