Based on Neel Nanda's guide
Commitment : 12hrs/Week for 4 weeks
Structured as a workbook with concrete weekly tasks, code deliverables, and progress tracking.
Focus: Build intuition for transformers, use TransformerLens, and run experiments on GPT-2-small.
Goal: Train an MLP on MNIST, understand transformer architecture basics.
Time: 12 hours
-
PyTorch Basics (4 hrs)
- Code an MLP for MNIST (input: 784 → hidden: 256 → output: 10).
- Use
torch.nn.Sequential
,DataLoader
, andCrossEntropyLoss
. - Deliverable: Achieve >95% test accuracy.
-
Transformer Architecture (6 hrs)
- Study Callum McDougall’s Transformer from Scratch or Neel Nanda’s Transformer from Scratch.
- Code: Implement a single transformer block (attention + MLP) from scratch.
- Deliverable: Pass the tutorial’s test cases (no copying!).
-
Python Practice (2 hrs)
- Rewrite data loading with
torch.utils.data.Dataset
andzip
for batching. - Use list comprehensions for MNIST preprocessing.
- Rewrite data loading with
- MLP trained on MNIST.
- Single transformer block code passing tests.
- Read Barebones Guide to MI Prerequisites.
Goal: Use TransformerLens to probe GPT-2-small, visualize activations.
Time: 12 hours
-
TransformerLens Setup (3 hrs)
- Install and run Main Demo.
- Extract MLP activations for the prompt “Hello, world!”.
-
Induction Heads Tutorial (6 hrs)
- Complete Induction Heads exercises.
- Deliverable: Plot attention patterns for induction heads.
-
Python Practice (3 hrs)
- Use
einops
to reshape GPT-2 activations (e.g.,rearrange(activations, 'b s h -> h (b s)')
). - Write a decorator to log tensor shapes during inference.
- Use
- Induction heads attention patterns plotted.
- GPT-2 activations extracted for 10 prompts.
- Watch Mathematical Framework Walkthrough.
Goal: Replicate a key result from Interpretability in the Wild (Rimsky et al.).
Time: 12 hours
-
Paper Deep Dive (3 hrs)
- Read Interpretability in the Wild (Sections 1-3).
- Summarize their methodology for activation patching.
-
Code Replication (7 hrs)
- Use TransformerLens to implement activation patching on GPT-2-small.
- Deliverable: Reproduce Fig 3 (ablation effect on IOI task).
-
Python Practice (2 hrs)
- Write a generator for synthetic prompts (e.g., “John gave Mary a {object}”).
- Use
functools.partial
to batch-process prompts.
- Activation patching code for IOI task.
- 1-page paper summary with techniques/limitations.
- Join ML Collective Discord for feedback.
Goal: Tackle a problem from 200 Concrete Open Problems.
Time: 12 hours
-
Problem Selection (2 hrs)
- Choose a problem tagged A (Easy) (e.g., “Does GPT-2-small use positional embeddings in MLP layers?”).
-
Experimentation (8 hrs)
- Use TransformerLens to extract positional embeddings and ablate MLPs.
- Deliverable: Plot logit differences before/after ablation.
-
Documentation (2 hrs)
- Write a blog-style summary of findings (500 words).
- Share in ML Collective Discord for feedback.
- Ablation experiment code.
- Blog post draft.
- Skim Concrete Open Problems Appendix.
-
Fork the Repository
# Click the 'Fork' button in the top right of the GitHub repository page
-
Clone Your Fork
git clone https://github.com/YOUR_USERNAME/mechanistic-interpretability-course.git cd mechanistic-interpretability-course
-
Create and Activate Virtual Environment
# For Python venv python -m venv venv # On Windows .\venv\Scripts\activate # On Unix or MacOS source venv/bin/activate
-
Install Dependencies
pip install -r requirements.txt
-
Create Directory Structure
# Make the setup script executable chmod +x setup.sh # Run the setup script ./setup.sh
mechanistic-interpretability-course/
├── week1/
│ ├── mnist_mlp/
│ ├── transformer_block/
│ └── python_practice/
├── week2/
│ ├── transformerlens_setup/
│ ├── induction_heads/
│ └── python_practice/
├── week3/
│ ├── paper_analysis/
│ ├── activation_patching/
│ └── python_practice/
├── week4/
│ ├── problem_selection/
│ ├── experiments/
│ └── blog_post/
├── requirements.txt
└── README.md
-
Track Your Progress
- Each week's folder contains a README.md file for tracking progress
- Use the provided Notion template for detailed progress tracking
-
Submitting Work
- Create a new branch for each week's work:
git checkout -b week1-solutions
- Commit your changes regularly:
git add . git commit -m "Completed MNIST MLP implementation"
- Push to your fork:
git push origin week1-solutions
- Create a new branch for each week's work:
-
Getting Updates
- Add the original repository as upstream:
git remote add upstream https://github.com/ORIGINAL_OWNER/mechanistic-interpretability-course.git
- Fetch and merge updates:
git fetch upstream git merge upstream/main
- Add the original repository as upstream:
- Each
notebooks
directory can be synced with Google Colab - Use the "Open in Colab" button and save copies to your Google Drive
- Remember to save your work back to the repository
- Join the ML Collective Discord
- Open an issue in the repository
- Check the TransformerLens Documentation
- Debugging: Use
%debug
in Colab for post-mortem inspection of shape errors. - Compute: For free GPU, use Colab → Runtime → Change runtime type → T4 GPU.
- Tooling: Bookmark TransformerLens Docs.
By the end of this plan, you’ll have hands-on experience with transformers, replicated a paper, and contributed to open problems. Adjust tasks as needed, but prioritize coding over passive reading!