Webtagr - Technology News Summarizer

Popular：

Virtualization DNS security formal verification reachability analysis compiler errors macro conflict web extension development framework Bitmap Graphics API inconsistencies All Tags

LLMs Fall Short at IMO 2025: Medal-Level Performance Remains Elusive

2025-07-19

Researchers evaluated five state-of-the-art large language models (LLMs) on the 2025 International Mathematical Olympiad (IMO) problems using the MathArena platform. Gemini 2.5 Pro performed best, achieving only a 31% score (13 points), far below the 19 points needed for a bronze medal. Other models lagged significantly. A best-of-32 selection strategy, generating and evaluating multiple responses per problem, significantly increased computational cost. Despite this, the results demonstrate a substantial gap between current LLMs and medal-level performance on extremely challenging mathematical problems like those in the IMO, even with substantial computational resources. Qualitative analysis revealed issues such as models citing nonexistent theorems and providing overly concise answers.

MathArena: Rigorously Evaluating LLMs on Math Competitions

2025-04-02

MathArena is a platform for evaluating large language models (LLMs) on recent math competitions and olympiads. It ensures fair and unbiased evaluation by testing models exclusively on post-release competitions, preventing retroactive assessments on potentially leaked data. The platform publishes leaderboards for each competition, showing individual problem scores for different models, and a main table summarizing performance across all competitions. Each model runs four times per problem, averaging the score and calculating the cost (in USD). The evaluation code is open-sourced: https://github.com/eth-sri/matharena.