Skip to content

A 4 Week Mechanistic Interpretability Beginner Course based on Neel Nanda's guide

Notifications You must be signed in to change notification settings

nerdlab53/mech-interp-course

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

6 Commits
 
 
 
 

Repository files navigation

Mechanistic Interpretability 4 Week Course

Based on Neel Nanda's guide

Commitment : 12hrs/Week for 4 weeks

Structured as a workbook with concrete weekly tasks, code deliverables, and progress tracking.
Focus: Build intuition for transformers, use TransformerLens, and run experiments on GPT-2-small.

Directions to use:

Week 1: ML Prerequisites & PyTorch Fluency

Goal: Train an MLP on MNIST, understand transformer architecture basics.
Time: 12 hours

Tasks

  1. PyTorch Basics (4 hrs)

    • Code an MLP for MNIST (input: 784 → hidden: 256 → output: 10).
    • Use torch.nn.Sequential, DataLoader, and CrossEntropyLoss.
    • Deliverable: Achieve >95% test accuracy.
  2. Transformer Architecture (6 hrs)

  3. Python Practice (2 hrs)

    • Rewrite data loading with torch.utils.data.Dataset and zip for batching.
    • Use list comprehensions for MNIST preprocessing.

Success Criteria


Week 2: TransformerLens & Mechanistic Intuition

Goal: Use TransformerLens to probe GPT-2-small, visualize activations.
Time: 12 hours

Tasks

  1. TransformerLens Setup (3 hrs)

    • Install and run Main Demo.
    • Extract MLP activations for the prompt “Hello, world!”.
  2. Induction Heads Tutorial (6 hrs)

  3. Python Practice (3 hrs)

    • Use einops to reshape GPT-2 activations (e.g., rearrange(activations, 'b s h -> h (b s)')).
    • Write a decorator to log tensor shapes during inference.

Success Criteria


Week 3: Replicate a Paper & Debugging

Goal: Replicate a key result from Interpretability in the Wild (Rimsky et al.).
Time: 12 hours

Tasks

  1. Paper Deep Dive (3 hrs)

  2. Code Replication (7 hrs)

    • Use TransformerLens to implement activation patching on GPT-2-small.
    • Deliverable: Reproduce Fig 3 (ablation effect on IOI task).
  3. Python Practice (2 hrs)

    • Write a generator for synthetic prompts (e.g., “John gave Mary a {object}”).
    • Use functools.partial to batch-process prompts.

Success Criteria

  • Activation patching code for IOI task.
  • 1-page paper summary with techniques/limitations.
  • Join ML Collective Discord for feedback.

Week 4: Open Problems & Mini-Research

Goal: Tackle a problem from 200 Concrete Open Problems.
Time: 12 hours

Tasks

  1. Problem Selection (2 hrs)

    • Choose a problem tagged A (Easy) (e.g., “Does GPT-2-small use positional embeddings in MLP layers?”).
  2. Experimentation (8 hrs)

    • Use TransformerLens to extract positional embeddings and ablate MLPs.
    • Deliverable: Plot logit differences before/after ablation.
  3. Documentation (2 hrs)

    • Write a blog-style summary of findings (500 words).
    • Share in ML Collective Discord for feedback.

Success Criteria


Setup Instructions

  1. Fork the Repository

    # Click the 'Fork' button in the top right of the GitHub repository page
  2. Clone Your Fork

    git clone https://github.com/YOUR_USERNAME/mechanistic-interpretability-course.git
    cd mechanistic-interpretability-course
  3. Create and Activate Virtual Environment

    # For Python venv
    python -m venv venv
    
    # On Windows
    .\venv\Scripts\activate
    
    # On Unix or MacOS
    source venv/bin/activate
  4. Install Dependencies

    pip install -r requirements.txt
  5. Create Directory Structure

    # Make the setup script executable
    chmod +x setup.sh
    
    # Run the setup script
    ./setup.sh

Repository Structure

mechanistic-interpretability-course/
├── week1/
│   ├── mnist_mlp/
│   ├── transformer_block/
│   └── python_practice/
├── week2/
│   ├── transformerlens_setup/
│   ├── induction_heads/
│   └── python_practice/
├── week3/
│   ├── paper_analysis/
│   ├── activation_patching/
│   └── python_practice/
├── week4/
│   ├── problem_selection/
│   ├── experiments/
│   └── blog_post/
├── requirements.txt
└── README.md

Working with the Repository

  1. Track Your Progress

    • Each week's folder contains a README.md file for tracking progress
    • Use the provided Notion template for detailed progress tracking
  2. Submitting Work

    • Create a new branch for each week's work:
      git checkout -b week1-solutions
    • Commit your changes regularly:
      git add .
      git commit -m "Completed MNIST MLP implementation"
    • Push to your fork:
      git push origin week1-solutions
  3. Getting Updates

    • Add the original repository as upstream:
      git remote add upstream https://github.com/ORIGINAL_OWNER/mechanistic-interpretability-course.git
    • Fetch and merge updates:
      git fetch upstream
      git merge upstream/main

Using Google Colab

  • Each notebooks directory can be synced with Google Colab
  • Use the "Open in Colab" button and save copies to your Google Drive
  • Remember to save your work back to the repository

Need Help?

Pro Tips

  1. Debugging: Use %debug in Colab for post-mortem inspection of shape errors.
  2. Compute: For free GPU, use Colab → Runtime → Change runtime type → T4 GPU.
  3. Tooling: Bookmark TransformerLens Docs.

By the end of this plan, you’ll have hands-on experience with transformers, replicated a paper, and contributed to open problems. Adjust tasks as needed, but prioritize coding over passive reading!

About

A 4 Week Mechanistic Interpretability Beginner Course based on Neel Nanda's guide

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published

Languages