Evals Are Not Enough: The Limitations of LLM Evaluation

2025-03-03

This article critiques the prevalent practice of relying on evaluations to guarantee the performance of Large Language Model (LLM) software. While acknowledging the role of evals in comparing different base models and unit testing, the author highlights several critical flaws in their real-world application: difficulty in creating comprehensive test datasets; limitations of automated scoring methods; the inadequacy of evaluating only the base model without considering the entire system's performance; and the masking of severe errors by averaging evaluation results. The author argues that evals fail to address the inherent "long tail problem" of LLMs, where unexpected situations always arise in production. Ultimately, the article calls for a change in LLM development practices, advocating for a shift away from solely relying on evals and towards prioritizing user testing and more comprehensive system testing.