SWE-bench: Can LLMs Solve Real-World GitHub Issues?
2025-01-08
SWE-bench is a benchmark dataset evaluating large language models' ability to automatically resolve real-world GitHub issues. Researchers compiled 2,294 Issue-Pull Request pairs from 12 popular Python repositories, validating solutions via unit tests. The latest leaderboard showcases various models achieving varying success rates, with some exceeding 50% resolution. The project provides resources including a lite version and pre-trained models for easier evaluation and reproducibility.
Development
Code Repair