Web Bench: A New Benchmark for Evaluating Web Browsing Agents

2025-05-29
Web Bench: A New Benchmark for Evaluating Web Browsing Agents

Web Bench is a new dataset for evaluating web browsing agents, comprising 5,750 tasks across 452 websites, with 2,454 tasks open-sourced. The benchmark reveals shortcomings in existing agents' handling of write-heavy tasks (login, form filling, file downloads), highlighting the importance of browser infrastructure. Anthropic Sonnet 3.7 CUA achieved the highest performance. This research exposes the challenges in automating web interactions and paves the way for more robust AI agents.

Read more

Skyvern Browser Agent 2.0: Achieving State-of-the-Art in Web Automation

2025-01-17
Skyvern Browser Agent 2.0: Achieving State-of-the-Art in Web Automation

Skyvern, an open-source no-code browser agent builder, released version 2.0. This release boasts a state-of-the-art 85.85% score on the WebVoyager benchmark, achieved by implementing a planner-actor-validator agent loop. This architecture breaks down complex instructions into smaller, manageable tasks, and a validation step ensures successful completion. Skyvern 2.0 can handle complex prompts like "Navigate to Amazon and add an iPhone 16, case, and screen protector to cart." The team's commitment to open source is further demonstrated by publicly releasing the entire evaluation results.

Read more
Development browser automation