Putnam-AXIOM: A New Benchmark Shatters LLM Mathematical Reasoning Abilities

Popular：

Virtualization DNS security formal verification reachability analysis compiler errors macro conflict web extension development framework Bitmap Graphics API inconsistencies All Tags

Putnam-AXIOM: A New Benchmark Shatters LLM Mathematical Reasoning Abilities

2025-01-01

Researchers introduced Putnam-AXIOM, a challenging benchmark comprising 236 problems from the William Lowell Putnam Mathematical Competition, designed to evaluate the higher-level mathematical reasoning capabilities of Large Language Models (LLMs). To mitigate data contamination, a variation benchmark with functional alterations of 52 problems was also created. Results show even top-performing models experience a significant accuracy drop (around 30%) on the variations compared to the originals, highlighting substantial room for improvement in LLM mathematical reasoning.

(openreview.net)

AI Mathematical Reasoning

Mastodon Web App Requires JavaScript

Notion: Your All-in-One Workspace