Tracing Circuits: Uncovering Computational Graphs in LLMs

Researchers introduce a novel approach for interpreting the inner workings of deep learning models using cross-layer transcoders (CLTs). CLTs decompose model activations into sparse, interpretable features and construct causal graphs of feature interactions, revealing how the model generates outputs. The method successfully explains model responses to various prompts (e.g., acronym generation, factual recall, and simple addition) and is validated through perturbation experiments. While limitations exist, such as the inability to fully explain attention mechanisms, it provides a valuable tool for understanding the inner workings of large language models.
Read more