Explosion of Papers on Benchmarking LLMs for Code Generation

Popular：

Virtualization DNS security formal verification reachability analysis compiler errors macro conflict web extension development framework Bitmap Graphics API inconsistencies All Tags

Explosion of Papers on Benchmarking LLMs for Code Generation

2025-02-11

A flurry of recent arXiv preprints focuses on benchmarking large language models (LLMs) for code generation. These papers cover various aspects, including LLMs solving real-world GitHub issues, self-invoking code generation, API usage, stability analysis, and evaluations across the entire software development lifecycle. Researchers have developed diverse benchmarks like SWE-bench, HumanEval Pro, SEAL, and DevEval, along with corresponding metrics, aiming for a more comprehensive evaluation of LLM code generation capabilities and driving progress in the field.

(www.hackerrank.com)

Development

NOAA: The Unsung Hero Behind US Weather Forecasts

arXivLabs: Experimental Projects with Community Collaborators