Evaluating LLMs in Text Adventures: A Novel Approach

2025-08-12

This article proposes a novel method for evaluating the capabilities of large language models (LLMs) in text adventure games. The approach involves setting a turn limit and defining a set of in-game achievements to measure how well an LLM can progress within those constraints. Due to the high degree of freedom and branching in text adventures, this method isn't designed to provide an absolute performance score, but rather to offer a relative comparison between different LLMs. The LLM is given a series of achievement goals and a limited number of turns to achieve them; the final score is based on the number of achievements completed. Even powerful LLMs struggle to explore all branches within the turn limit, making the score a reflection of relative capability rather than absolute gaming skill.