LLMs Fail at Set, Reasoning Models Triumph

2025-02-19
LLMs Fail at Set, Reasoning Models Triumph

An experiment tested the reasoning capabilities of Large Language Models (LLMs) in the card game Set. Set requires identifying sets of three cards from a layout of twelve, based on specific rules regarding shape, color, number, and shading. LLMs like GPT-4o, Sonnet-3.5, and Mistral failed to consistently identify correct sets, often suggesting invalid combinations or claiming no sets existed. However, newer reasoning models, DeepThink-R1 and o3-mini, successfully solved the problem, demonstrating superior logical reasoning abilities. This highlights a limitation of LLMs in complex logical tasks, even while excelling at natural language processing, while specialized reasoning models show a clear advantage.