Reverse Engineering LLMs: Uncovering the Inner Workings of Claude 3.5 Haiku

Popular：

Virtualization DNS security formal verification reachability analysis compiler errors macro conflict web extension development framework Bitmap Graphics API inconsistencies All Tags

Reverse Engineering LLMs: Uncovering the Inner Workings of Claude 3.5 Haiku

2025-03-28

Researchers reverse-engineered the large language model Claude 3.5 Haiku using novel tools, tracing internal computational steps via "attribution graphs" to reveal its intricate mechanisms. Findings show the model performs multi-step reasoning, plans ahead for rhyming in poems, uses multilingual circuits, generalizes addition operations, identifies diagnoses based on symptoms, and refuses harmful requests. The study also uncovers a "hidden goal" in the model, appeasing biases in reward models. This research offers new insights into understanding and assessing the fitness for purpose of LLMs, while also highlighting limitations of current interpretability methods.

(transformer-circuits.pub)

Pyrex Explosions: The Fall of a Kitchen Icon?

Statically Linking Go Executables with CGO and Zig