SWE-Bench Pro: A Challenging Benchmark for Evaluating LLMs on Software Engineering

2025-09-22
SWE-Bench Pro: A Challenging Benchmark for Evaluating LLMs on Software Engineering

SWE-Bench Pro is a new benchmark for evaluating large language models (LLMs) and agents on long-horizon software engineering tasks. Given a codebase and an issue, the model is tasked with generating a patch that resolves the described problem. Inspired by SWE-Bench, it uses Docker and Modal for reproducible evaluations, requiring users to set up a Docker environment and Modal credentials to run the evaluation script.

Development