Reverse Engineering LLMs: Uncovering the Inner Workings of Claude 3.5 Haiku

2025-03-28

Researchers reverse-engineered the large language model Claude 3.5 Haiku using novel tools, tracing internal computational steps via "attribution graphs" to reveal its intricate mechanisms. Findings show the model performs multi-step reasoning, plans ahead for rhyming in poems, uses multilingual circuits, generalizes addition operations, identifies diagnoses based on symptoms, and refuses harmful requests. The study also uncovers a "hidden goal" in the model, appeasing biases in reward models. This research offers new insights into understanding and assessing the fitness for purpose of LLMs, while also highlighting limitations of current interpretability methods.

AI