Mechanistic Interpretability (SAE)¶

~3 минуты чтения

Предварительно: Эффективные трансформеры

Source: BinaryVerse AI "Mechanistic Interpretability: A Journey Through Neural Networks" (2025), Neuronpedia, Gemma Scope

Концепция¶

Проблема которую решает: - Neural networks -- "black boxes": мы знаем WHAT они делают, но не WHY - Neurons -- polysemantic (один нейрон = несколько концепций) - Superposition: больше фичей чем нейронов

Mechanistic Interpretability: - Понимание КАК модель делает предсказания - Circuit tracing: какие пути активируются - Feature discovery: что представляет каждый нейрон/feature

Ключевые термины¶

Термин	Определение
Activations	Значения нейронов при прохождении input через сеть
Features	Концепции/паттерны которые модель представляет (NOT = neurons)
Polysemanticity	Один нейрон представляет несколько unrelated features
Superposition	Больше features чем нейронов, encoded simultaneously
Circuit	Путь через сеть от input features к output features
Patching	Intervention technique -- replace activations to test causality
SAE	Sparse Autoencoder -- unsupervised method для feature discovery

Activation Patching¶

Метод:

1. Forward pass с input A -> activations_A
2. Forward pass с input B -> activations_B
3. Patch: replace activations_B[layer] с activations_A[layer]
4. Observe: изменился ли output?
5. Conclusion: layer критичен для различия A/B

Python Implementation:

def activation_patch(model, input_a, input_b, layer_idx):
    # Get activations for both inputs
    with torch.no_grad():
        _, acts_a = model(input_a, return_intermediate=True)
        _, acts_b = model(input_b, return_intermediate=True)

    # Patch: replace B's activations at layer with A's
    def patch_hook(module, input, output):
        return acts_a[layer_idx]

    # Register hook and run forward
    handle = model.layers[layer_idx].register_forward_hook(patch_hook)
    patched_output, _ = model(input_b, return_intermediate=True)
    handle.remove()

    return patched_output

ACDC (Automated Circuit Discovery)¶

Алгоритм:

1. Start with full computation graph
2. For each edge:
   a. Patch edge -> measure effect on output
   b. If effect < threshold -> remove edge
3. Result: minimal circuit necessary for behavior

Критерий важности: $$ \text{Importance} = |\text{Output}{clean} - \text{Output}| $$

Sparse Autoencoders (SAE)¶

Почему Sparse: - Neural activations -- sparse по природе (10-20% активны) - Standard PCA/AE -- dense reconstruction (не интерпретируемо) - Sparse coding -> interpretable features

Архитектура: $$ \text{Encoded: } \mathbf{z} = \text{ReLU}(\mathbf{W}_e \mathbf{x} + \mathbf{b}_e) $$

\[ \text{Decoded: } \hat{\mathbf{x}} = \mathbf{W}_d \mathbf{z} + \mathbf{b}_d \]

Loss Function: $$ \mathcal{L} = \underbrace{|\mathbf{x} - \hat{\mathbf{x}}|^2}{\text{Reconstruction}} + \underbrace{\lambda |\mathbf{z}|_1} $$}

Python SAE для LLM:

class SparseAutoencoder(nn.Module):
    def __init__(self, input_dim, hidden_dim, sparsity_lambda=0.01):
        super().__init__()
        self.encoder = nn.Linear(input_dim, hidden_dim)
        self.decoder = nn.Linear(hidden_dim, input_dim)
        self.sparsity_lambda = sparsity_lambda

    def forward(self, x):
        z = F.relu(self.encoder(x))  # Sparse encoding
        x_recon = self.decoder(z)
        return x_recon, z

    def loss(self, x, x_recon, z):
        recon_loss = F.mse_loss(x, x_recon)
        sparsity_loss = self.sparsity_lambda * torch.mean(torch.abs(z))
        return recon_loss + sparsity_loss

Circuit Tracing & Attribution Graphs¶

Attribution Graph:

Input -> [Feature 1] -> [Feature 2] -> Output
         ^              ^
    [Feature 3] -> [Feature 4]

Nodes: Features (discovered by SAE)
Edges: Causal connections (discovered by patching)
Weights: Attribution strength

Tools: - Neuronpedia: Interactive feature visualization - Gemma Scope: Pre-trained SAEs for Gemma models - TransformerLens: Hook-based activation extraction

Практический Workflow (5 шагов)¶

1. EXTRACT activations
   model.run_with_cache(tokens) -> cache['blocks.0.attn.hook_q']

2. TRAIN SAE
   SAE(input_dim=hidden_size, hidden_dim=hidden_size * 8)

3. ANALYZE features
   - Which features activate on specific inputs?
   - What concepts do features represent?

4. TRACE circuits
   - Which features connect to which?
   - Patch to verify causal relationships

5. VERIFY interpretations
   - Do discovered features match human concepts?
   - Can we predict behavior from features?

Интервью вопросы¶

Q: В чём разница между feature и neuron?

A: Neuron -- физическая единица сети. Feature -- концептуальная единица информации. Из-за polysemanticity один нейрон может представлять несколько unrelated features. Sparse Autoencoders позволяют "распутать" их в interpretable features.

Q: Что такое superposition и почему это важно?

A: Superposition -- когда сеть представляет больше features чем нейронов, кодируя их в комбинациях активаций. Это позволяет модели хранить больше информации, но делает интерпретацию сложной. SAE помогают "декодировать" superposition.

Q: Как работает activation patching?

A: Forward pass с input A сохраняет activations. Forward pass с input B, но с patching -- заменяем activations на определённом слое на values от A. Если output меняется -- этот слой критичен для различия между A и B. Это causal intervention, не просто корреляция.

Q: Зачем нужен sparse autoencoder вместо обычного?

A: Обычный AE даёт dense representation -- все features активны одновременно, не интерпретируемо. SAE с L1 regularization даёт sparse representation -- только несколько features активны для каждого input, каждая feature представляет interpretable concept.

Q: Что такое circuit в контексте interpretability?

A: Circuit -- путь через сеть от input features к output features через intermediate features. ACDC (Automated Circuit Discovery) автоматически находит минимальный набор связей необходимых для конкретного behavior модели.

Production Tools (2026)¶

Tool	Назначение	Статус
TransformerLens	Activation extraction + hooks	Open source
Neuronpedia	Feature visualization	Web app
Gemma Scope	Pre-trained SAEs для Gemma	Open source
Anthropic's Circuit Tracing	Attribution graphs	Research
OpenAI's Microscope	Feature visualizations	Deprecated

2026 Directions¶

Active Research: - Scalable SAEs: Training SAEs для 100B+ parameter models - Automated Interpretability: LLMs interpreting other LLMs - Safety Applications: Using interpretability for alignment - Cross-Model Features: Do similar features exist across models?