Web Bench: A New Benchmark for Evaluating Web Browsing Agents

Popular：

Virtualization DNS security formal verification reachability analysis compiler errors macro conflict web extension development framework Bitmap Graphics API inconsistencies All Tags

Web Bench: A New Benchmark for Evaluating Web Browsing Agents

2025-05-29

Web Bench is a new dataset for evaluating web browsing agents, comprising 5,750 tasks across 452 websites, with 2,454 tasks open-sourced. The benchmark reveals shortcomings in existing agents' handling of write-heavy tasks (login, form filling, file downloads), highlighting the importance of browser infrastructure. Anthropic Sonnet 3.7 CUA achieved the highest performance. This research exposes the challenges in automating web interactions and paves the way for more robust AI agents.

(blog.skyvern.com)

AI web browsing agents

Moon's Missing Magnetism: Solved by an Ancient Impact?

Climate Impulse: Bertrand Piccard's Hydrogen-Powered Flight Around the World