LLMs Fall Short at IMO 2025: Medal-Level Performance Remains Elusive
Researchers evaluated five state-of-the-art large language models (LLMs) on the 2025 International Mathematical Olympiad (IMO) problems using the MathArena platform. Gemini 2.5 Pro performed best, achieving only a 31% score (13 points), far below the 19 points needed for a bronze medal. Other models lagged significantly. A best-of-32 selection strategy, generating and evaluating multiple responses per problem, significantly increased computational cost. Despite this, the results demonstrate a substantial gap between current LLMs and medal-level performance on extremely challenging mathematical problems like those in the IMO, even with substantial computational resources. Qualitative analysis revealed issues such as models citing nonexistent theorems and providing overly concise answers.
Read more