Putnam-AXIOM: A New Benchmark Shatters LLM Mathematical Reasoning Abilities
2025-01-01
Researchers introduced Putnam-AXIOM, a challenging benchmark comprising 236 problems from the William Lowell Putnam Mathematical Competition, designed to evaluate the higher-level mathematical reasoning capabilities of Large Language Models (LLMs). To mitigate data contamination, a variation benchmark with functional alterations of 52 problems was also created. Results show even top-performing models experience a significant accuracy drop (around 30%) on the variations compared to the originals, highlighting substantial room for improvement in LLM mathematical reasoning.