Stars
Entropy Based Sampling and Parallel CoT Decoding
Internalizing steering vectors via fine tuning
Code to enable layer-level steering in LLMs using sparse auto encoders
Training Sparse Autoencoders on Language Models
PyTorch and NNsight implementation of AtP* (Kramar et al 2024, DeepMind)
A library for efficient patching and automatic circuit discovery.
Create feature-centric and prompt-centric visualizations for sparse autoencoders (like those from Anthropic's published research).
My interpretation of what einops indexing would look like (created to work on during my SERI MATS project).
Stanford NLP Python Library for Understanding and Improving PyTorch Models via Interventions
Large scale 4D parallelism pre-training for 🤗 transformers in Mixture of Experts *(still work in progress)*
Fork of Arthur Conmy's Automatic-Circuit-Discovery for the purpose of conducting ACDC research
Code for my NeurIPS 2024 ATTRIB paper titled "Attribution Patching Outperforms Automated Circuit Discovery"
Repo for hosting Streamlit pages for my 2023 SERI MATS project with Arthur Conmy (mentored by Neel Nanda).
Type annotations and runtime checking for shape and dtype of JAX/NumPy/PyTorch/etc. arrays. https://docs.kidger.site/jaxtyping/
Resources for skilling up in AI alignment research engineering. Covers basics of deep learning, mechanistic interpretability, and RL.
Mechanistic Interpretability Visualizations using React
berkott / lucent
Forked from greentfrapp/lucentLucid library adapted for PyTorch with new features for ViTs and MLP-Mixers
A library for mechanistic interpretability of GPT-style language models
An autoregressive character-level language model for making more things