model-steering - SKILL.md Agent Skill

name: model-steering description: Control model behavior through persistent edits and steering interventions. Use when modifying model outputs, applying steering vectors, or creating persistently modified model versions.

Model Steering

Model steering manipulates model activations to control outputs without retraining. This includes one-off interventions, persistent edits, and steering vector techniques.

Basic Steering Intervention

Modify activations during a single forward pass:

from nnsight import LanguageModel
import torch

model = LanguageModel("openai-community/gpt2", device_map="auto", dispatch=True)

# Add a steering vector to layer 10
steering_vector = torch.randn(768)  # Match hidden dimension

with model.trace("I think the movie was") as tracer:
    # Add steering vector to residual stream
    model.transformer.h[10].output[0][:, -1, :] += steering_vector
    steered_logits = model.lm_head.output.save()

Computing Steering Vectors

Contrastive Activation Difference

positive_prompts = [
    "I love this! It's fantastic",
    "This is wonderful and amazing",
    "I'm so happy about this"
]
negative_prompts = [
    "I hate this! It's terrible",
    "This is awful and horrible",
    "I'm so sad about this"
]

layer_idx = 10
positive_acts = []
negative_acts = []

with model.trace() as tracer:
    for prompt in positive_prompts:
        with tracer.invoke(prompt):
            act = model.transformer.h[layer_idx].output[0][:, -1, :].save()
            positive_acts.append(act)

    for prompt in negative_prompts:
        with tracer.invoke(prompt):
            act = model.transformer.h[layer_idx].output[0][:, -1, :].save()
            negative_acts.append(act)

# Average difference = steering vector
pos_mean = torch.stack([a.value for a in positive_acts]).mean(dim=0)
neg_mean = torch.stack([a.value for a in negative_acts]).mean(dim=0)
steering_vector = pos_mean - neg_mean

Applying the Steering Vector

steering_strength = 1.5

with model.trace("The weather today is"):
    # Steer toward positive sentiment
    model.transformer.h[layer_idx].output[0][:, :, :] += steering_strength * steering_vector
    output = model.lm_head.output.save()

# Generate with steering (apply to all generation steps)
with model.generate("The weather today is", max_new_tokens=20) as tracer:
    with tracer.all():  # Apply intervention to every generation iteration
        model.transformer.h[layer_idx].output[0][:, -1, :] += steering_strength * steering_vector

    generated = model.generator.output.save()

print(model.tokenizer.decode(generated[0]))

Persistent Model Editing

Create a modified model version that persists across traces:

# Extract activations from a "source" prompt
with model.trace("The capital of France is Paris") as tracer:
    paris_hidden = model.transformer.h[-1].output[0].save()

# Create persistently edited model
with model.edit() as edited_model:
    model.transformer.h[-1].output[0][:] = paris_hidden.value

# Use edited model - will always output Paris-related content
with edited_model.trace("The capital of Germany is"):
    logits = model.lm_head.output.save()

# Can use edited model multiple times
with edited_model.trace("The capital of Japan is"):
    logits2 = model.lm_head.output.save()

Ablation Studies

Zero out components to test necessity:

# Zero ablation of attention head
head_idx = 5
head_dim = model.config.n_embd // model.config.n_head
start = head_idx * head_dim
end = (head_idx + 1) * head_dim

with model.trace(prompt):
    # Zero this head's contribution
    attn_out = model.transformer.h[10].attn.c_proj.input[0][0]
    attn_out[:, :, start:end] = 0
    ablated_logits = model.lm_head.output.save()

Mean Ablation

Replace with mean activation (less disruptive):

# First compute mean activation over many prompts
mean_acts = []
with model.trace() as tracer:
    for prompt in calibration_prompts:
        with tracer.invoke(prompt):
            act = model.transformer.h[10].output[0].save()
            mean_acts.append(act)

mean_activation = torch.stack([a.value for a in mean_acts]).mean(dim=0)

# Apply mean ablation
with model.trace(test_prompt):
    model.transformer.h[10].output[0][:] = mean_activation
    ablated_logits = model.lm_head.output.save()

Activation Addition (ActAdd)

Add vectors at inference time without modifying weights:

def actadd_generate(model, prompt, steering_vector, layer, strength=1.0, max_tokens=50):
    with model.generate(prompt, max_new_tokens=max_tokens) as tracer:
        with tracer.all():  # Apply to all generation iterations
            model.transformer.h[layer].output[0][:, -1, :] += strength * steering_vector

        output = model.generator.output.save()

    return model.tokenizer.decode(output[0])

Linear Probing for Steering

Train a probe to find steering directions:

from sklearn.linear_model import LogisticRegression

# Collect activations with labels
activations = []
labels = []

with model.trace() as tracer:
    for prompt, label in labeled_data:
        with tracer.invoke(prompt):
            act = model.transformer.h[layer_idx].output[0][:, -1, :].save()
            activations.append(act)
            labels.append(label)

X = torch.stack([a.value.squeeze() for a in activations]).numpy()
y = labels

# Train probe
probe = LogisticRegression()
probe.fit(X, y)

# Use probe coefficients as steering vector
steering_direction = torch.tensor(probe.coef_[0])

Multi-Layer Steering

Apply steering across multiple layers:

layer_weights = {6: 0.5, 8: 1.0, 10: 1.5, 12: 1.0}  # Layer: strength

with model.generate(prompt, max_new_tokens=30) as tracer:
    with tracer.all():  # Apply to all generation iterations
        for layer_idx, strength in layer_weights.items():
            model.transformer.h[layer_idx].output[0][:, -1, :] += \
                strength * steering_vector

    generated = model.generator.output.save()

print(model.tokenizer.decode(generated[0]))

Steering Vector Analysis

Interpret what your steering vector represents:

# Project steering vector through unembedding
steering_logits = model.lm_head.weight @ steering_vector

# Top tokens affected
top_k = 20
top_values, top_indices = steering_logits.topk(top_k)
bottom_values, bottom_indices = steering_logits.topk(top_k, largest=False)

print("Tokens boosted by steering:")
for val, idx in zip(top_values, top_indices):
    print(f"  {model.tokenizer.decode(idx)}: {val:.3f}")

print("\nTokens suppressed by steering:")
for val, idx in zip(bottom_values, bottom_indices):
    print(f"  {model.tokenizer.decode(idx)}: {val:.3f}")

Best Practices

Layer selection: Middle-to-late layers often work best
Strength tuning: Start small (0.1-0.5), increase gradually
Validation: Test on diverse prompts to avoid overfitting
Orthogonalization: Remove overlap with other behaviors
Position: Final token position usually most effective