LLM Performance on Advent of Code 2024: A Surprise

2024-12-30

This post details an experiment testing several leading Large Language Models (LLMs) on the 2024 Advent of Code challenge. Surprisingly, the LLMs performed worse than expected, even underperforming the author. A simple framework was used, providing the models with the complete problem description and requiring executable Python code. Results showed frequent timeouts and exceptions, suggesting LLMs excel at solving familiar problems but struggle with novel ones. This limitation might stem from reliance on program templates, insufficient computational resources, or suboptimal prompting. The experiment highlights Advent of Code as a potential benchmark for evaluating coding agents.