Goodfire Releases Interpretability Tools for Llama 3.3 70B

2024-12-23

Goodfire has trained sparse autoencoders (SAEs) on Llama 3.3 70B and released the interpreted model via an API. This allows exploration of the model's latent space through an interactive feature map. The team demonstrates feature steering capabilities and introduces improvements for easier and more reliable SAE-based steering. While showcasing progress in steering, limitations are acknowledged, including tension between feature steering and classification, and potential factual recall degradation at higher steering strengths. Future work includes refining steering methodologies and developing safety evaluations for responsible scaling of interpretability efforts.