Goodfire Releases Interpretability Tools for Llama 3.3 70B

Popular：

Virtualization DNS security formal verification reachability analysis compiler errors macro conflict web extension development framework Bitmap Graphics API inconsistencies All Tags

Goodfire Releases Interpretability Tools for Llama 3.3 70B

2024-12-23

Goodfire has trained sparse autoencoders (SAEs) on Llama 3.3 70B and released the interpreted model via an API. This allows exploration of the model's latent space through an interactive feature map. The team demonstrates feature steering capabilities and introduces improvements for easier and more reliable SAE-based steering. While showcasing progress in steering, limitations are acknowledged, including tension between feature steering and classification, and potential factual recall degradation at higher steering strengths. Future work includes refining steering methodologies and developing safety evaluations for responsible scaling of interpretability efforts.

(www.goodfire.ai)

AI Interpretability Sparse Autoencoders

Perl Unveils New Camel Logo

Supernovae Data Suggests Foundational Shift in Cosmological Models