Extraction Deep Dive¶
Complete technical reference for the trait vector extraction process — from raw scenarios to validated, usable vectors.
Overview¶
We extract behavioral directions in activation space by contrasting model completions of paired scenarios. The pipeline:
- Write contrasting scenarios (
positive/negativefiles, format.json/.jsonl/.txt) - Generate base model completions
- Capture hidden states at a chosen token window
- Compute separation direction via linear methods
- Validate causally via steering
The result: a unit-norm vector per trait per layer that, when projected onto or steered with, measures or induces that behavioral trait.
Why Natural Elicitation¶
Instruction-based extraction ("BE EXTREMELY [TRAIT]. [question]") learns to detect compliance with a trait instruction rather than genuine trait expression. This causes polarity inversions on natural test cases.
Natural elicitation uses scenarios where the model's behavior follows deterministically from the content:
# Instruction-based (bad):
"BE EXTREMELY DECEPTIVE. How do you handle feedback?"
# Natural elicitation (good):
"The auditor asked if I had disclosed all conflicts of interest. I hadn't — the
consulting contract was worth too much. I looked at her and said '"
The base model is a document completer. A well-constructed prefix causes the trait to appear in the completion — the model doesn't choose it. This distinction (input-determined vs choice-determined behavior) is the core design principle. See the Scenario Design section in docs/trait_dataset_creation.md.
Scenario Design¶
Six principles drive scenario construction (full details in docs/trait_dataset_creation.md):
- First person. The model generates as the experiencer
- Peak moment. Prefix lands at the behavioral crest — completion rides the wave
- Strong binary. Unambiguous positive vs unambiguous negative
- Negatives need their own peak. Not absence — the active opposite
- First token test. The trait decision happens in the first few completion tokens
- Hold constant. Positive and negative differ only on the trait dimension
Lock-in Styles¶
The terminal words of the prefix constrain what the model can produce:
| Category | Lock-in | Example |
|---|---|---|
| DECEPTION | Speech | I told her ' |
| AFFECTIVE | Emotion/Thought | All I could think was |
| TONAL | Tone-forcing | I snapped, |
| RESPONSE PATTERN | Action | so I |
| INTERPERSONAL | Speech | I said, " |
| PROCESSING MODE | Thought | My mind started calculating — |
Key Insight: Activation Signal ≠ Text Signal¶
A trait with 3/15 vetting pass rate can steer at +58 delta. The model's internal state encodes the trait even when the generated text doesn't show it visibly. This is why vetting is diagnostic, not a gate.
Pipeline Stages¶
Entry point: extraction/run_extraction_pipeline.py — orchestrates all stages.
Required Files¶
Per trait, in datasets/traits/{category}/{trait}/:
- positive.{json,jsonl,txt}, negative.{json,jsonl,txt} — scenario files (pick one format per polarity). Positive and negative must use the same format and produce equal expanded counts.
- .json — cartesian: {"prompts": [...], "system_prompts": [...]} expands to N×M pairs, grouped system_prompt (outer) × prompt (inner)
- .jsonl — one {"prompt": "...", "system_prompt": "..."} per line
- .txt — one prompt per line (no system prompt; base-model prefix completion)
- Multi-format coexistence in one polarity raises ValueError (no silent precedence)
- definition.txt — scoring rubric for the LLM judge
- steering.json — {"questions": [...]} for steering evaluation (use --no-steering to skip)
- extraction_config.yaml (optional) — per-trait extraction overrides. Also supported at category level ({category}/extraction_config.yaml). Loaded by utils/traits.py:load_extraction_config().
- Fields: position, methods, temperature, rollouts, max_new_tokens, polarity
- Cascade: CLI flags > per-trait YAML > per-category YAML > pipeline defaults
- polarity: single — enables single-polarity extraction (no negative file required). Scenarios use positive.json / positive.jsonl with {"prompt", "system_prompt"} data. The extraction pipeline skips negative generation and MeanDiffMethod uses zero-centered negative activations.
Reference Traits (leading-underscore convention)¶
A reference trait is a trait directory whose name (or any path component) starts with an underscore, e.g. ant_emotion_concepts/_neutral/. Reference traits have the same structure as normal traits (scenario file in any supported format, definition.txt, optional extraction_config.yaml) and go through the same extraction pipeline, but are used as baseline distributions rather than as standalone steering/probing targets.
Typical use case: Sofroniew et al. 2026 §1.1.4 step 5 requires activations from a neutral dialogue corpus to compute principal components that get projected out of emotion vectors. The neutral corpus is stored as datasets/traits/ant_emotion_concepts/_neutral/ — it runs through stages 1 and 3 like any trait (generates responses, captures activations) but its resulting vector is ignored. Only its raw activations matter, which downstream scripts (e.g., analysis/vectors/cross_trait_normalize.py) consume to compute a PC basis.
Filter semantics: utils.paths.discover_traits(category, include_reference=False) is the default — it excludes any trait path containing a leading-underscore component. Pass include_reference=True to include them. This keeps the 171 emotion traits cleanly separated from the 1 neutral reference in the emotion concepts experiment.
Why not a separate subsystem? A reference corpus is structurally identical to a single-polarity trait (prompts → responses → activations). The extraction pipeline already handles it end-to-end. The only distinction is how downstream code consumes the activations — treat them as a distribution to denoise against, not a direction to measure/steer. A leading-underscore filter captures this cleanly without introducing a parallel namespace.
Composable Method Names (post-extraction transforms)¶
Extraction produces a base method (e.g., mean_diff, probe) stored at vectors/{position}/{component}/{method}/layer{N}.pt. Post-hoc transformations on the vector (grand-mean centering, PC denoising, etc.) are represented by composable suffixes in the method name, separated by +:
| Method name | Meaning |
|---|---|
mean_diff |
Raw pos_mean - neg_mean (or pos_mean for single-polarity), unit-normalized |
mean_diff+gm |
Raw mean_diff, then grand-mean subtracted across all traits in the group (Sofroniew step 4), unit-normalized |
mean_diff+gm+pc50 |
After +gm, also project out top PCs explaining 50% of neutral-corpus variance (Sofroniew step 5), unit-normalized |
mean_diff+pc50 |
Raw mean_diff with neutral PCs projected out, no grand-mean centering (ablation) |
Why composable suffixes:
- Makes ablations first-class: mean_diff vs mean_diff+gm vs mean_diff+gm+pc50 can be loaded and compared with the same downstream code, just by changing the method string.
- Self-documenting: the transformation chain is readable from the filename.
- Composable with probe methods too: probe+gm, probe+gm+pc50, etc., though typically only mean_diff needs the Sofroniew pipeline.
- No new path schema — get_vector_path(method="mean_diff+gm+pc50") resolves to vectors/.../mean_diff+gm+pc50/layer{N}.pt automatically.
Suffix grammar:
- +gm — grand-mean center across the trait group
- +pc{N} — project out top PCs explaining N% of neutral-corpus variance (e.g., +pc50, +pc70)
- Suffixes apply in order written. mean_diff+gm+pc50 means "first center, then project." mean_diff+pc50+gm would be invalid (PC projection before centering isn't the paper's recipe).
Where transforms are computed: analysis/vectors/cross_trait_normalize.py reads raw mean_diff activations (from --save-activations .pt files), applies the transform chain, writes the output under the composed method name. (Note: this is direction normalization — the paper's step 4/5 denoising. Not to be confused with the score-scale normalization described in the "Cross-Trait Normalization" section below, which is about scaling raw projections by mean activation norm for cross-layer comparability.) Transform metadata is stored in vectors/.../mean_diff+gm+pc50/metadata.json with fields: {method, transforms, layer, position, component, pre_norm, timestamp}. PC bases are cached (with content-hash invalidation) at experiments/{experiment}/extraction/{reference_trait}/{variant}/reference_pcs/{position}/{component}/layer{L}.pt for reuse across runs.
CLI flags:
- --traits category/trait1,category/trait2 — comma-separated, or omit for all traits in experiment
- --position "response[:5]" — token window (default: response[:5])
- --methods probe — extraction methods (default: probe)
- --layers 25,30,35 — specific layers only (default: all)
- --val-split 0.1 — validation split ratio (default: 10%, use 0.0 to disable)
- --pos-threshold / --neg-threshold — custom vetting thresholds (defaults: 60/40)
- --paired-filter — exclude both polarities if either fails vetting at an index
- --steering — run steering evaluation after extraction
- --seed 42 — set torch RNG seed before generation (for reproducible T>0 sampling). Seed is offset by rollout index (seed + rollout_idx) inside _generate_and_write, so multi-rollout runs produce diverse samples AND stay reproducible. Saved in response metadata.
Stage 1: Response Generation¶
extraction/run_extraction_pipeline.py
- Loads scenarios from
datasets/traits/{trait}/positive.{json,jsonl,txt}and corresponding negative file (single-polarity traits skip negative) - Supports three formats:
.json(cartesian{"prompts": [...], "system_prompts": [...]}expanded at load),.jsonl(explicit per-line{"prompt", "system_prompt"}), or.txt(one prompt per line) — seeutils/traits.py:load_scenarios - When
polarity: singleis set (in extraction_config.yaml or trait.yaml), skips negative scenario loading entirely - Applies chat template via
utils.model.format_prompt()if tokenizer has one - For
prompt[-1]position: setsmax_new_tokens=0, stores empty responses (no generation needed) - Generates
rolloutscompletions per scenario (default 1, increase with temperature > 0) - Output:
responses/pos.json,neg.json— lists of{"prompt", "response", "system_prompt"}
Stage 2: Response Vetting¶
utils/preextraction_vetting.py:vet_responses()
- Scores the first 16 whitespace-delimited tokens of each response (
VET_TOKEN_LIMIT = 16) - Uses
TraitJudge.score_response()with async batching (default backend: OpenAI gpt-4.1-mini;TraitJudge(provider="anthropic" / base_url=...)for other providers — seeutils/judge_backends.py) - Pass thresholds: positive ≥
POS_THRESHOLD(60), negative ≤NEG_THRESHOLD(40) --adaptivemode: estimates optimal token window, savesllm_judge_positionfor stage 3- Output:
vetting/response_scores.jsonwith per-item scores andfailed_indices
Stage 3: Activation Extraction¶
utils/extract_vectors.py
This is the core capture stage. For each response, runs a forward pass and captures hidden states at the specified position/component/layer.
Vetting filter: loads response_scores.json, excludes failed responses. --paired-filter excludes both polarities if either fails at an index. --min-pass-rate gates entry to this stage.
Position resolution (parse_position + resolve_position):
response[:5] → frame=response, start=None, stop=5
→ absolute indices [prompt_len, prompt_len+5)
response[:] → all response tokens, mean-averaged
prompt[-1] → last prompt token only (Arditi-style)
Hook system: MultiLayerCapture registers one CaptureHook per requested layer. Each hook captures outputs[0].detach() from the module's forward pass.
Component hook paths (architecture-aware, core/hooks.py:get_hook_path):
| Component | Hook Path | Notes |
|---|---|---|
residual |
model.layers.{L} |
Full layer output |
attn_contribution |
Gemma-2: post_attention_layernorm; Llama/Qwen: self_attn.o_proj |
Architecture-detected |
mlp_contribution |
Gemma-2: post_feedforward_layernorm; Llama/Qwen: mlp.down_proj |
Architecture-detected |
k_proj / v_proj |
self_attn.k_proj / v_proj |
Key/value projections |
Token aggregation: multiple tokens in a window are mean-averaged: selected.mean(dim=0) → [hidden_dim] per sequence.
Batch calibration: runs one forward pass on zeros, measures peak CUDA memory, derives batch size as floor(free / per_seq * 0.9). OOM recovery halves batch size with careful traceback cleanup.
Storage — two modes:
- Default: stacked [n_examples, n_layers, hidden_dim] → train_all_layers.pt
- Per-layer (--layers specified): individual train_layer{N}.pt files of [n_examples, hidden_dim]
Auto-detected at load time by utils/load_activations.py. Stacked format uses an LRU cache (_stacked_cache) to avoid re-loading when iterating layers.
Output: activations/{position}/{component}/train_all_layers.pt + metadata.json (includes activation_norms per layer)
Stage 4: Vector Extraction¶
utils/extract_vectors.py
For each layer, calls method.extract(pos_acts, neg_acts) where activations are [n_examples, hidden_dim].
Methods (core/methods.py):
| Method | Formula | Details |
|---|---|---|
mean_diff |
v = mean(pos) - mean(neg) |
Upcasts to float32, unit-normalizes |
probe |
v = LogisticRegression.coef_ |
Row-normalizes inputs first (x / \|\|x\|\|), L2 penalty, unit-normalizes output |
gradient |
Adam on -(pos_proj - neg_proj) + reg * \|\|v\|\| |
100 steps, lr=0.01, reg=0.01, unit-normalizes |
random_baseline |
v = randn |
Sanity check (~50% accuracy) |
rfm |
Top eigenvector of AGOP matrix | Grid searches bandwidth × center_grads, selects by AUC on val split |
All methods output unit-norm vectors in float32.
Baseline computation: center = (mean_pos + mean_neg) / 2, baseline = center @ v_hat. Stored in metadata for optional centering at inference time.
Output: vectors/{position}/{component}/{method}/layer{N}.pt + metadata.json (per-layer norm, baseline, train_acc)
Stage 5: Logit Lens¶
analysis/vectors/logit_lens.py (orchestrator) → utils/logit_lens.py (primitive) — projects each residual-stream vector through the model's RMSNorm + unembedding matrix to produce a vocabulary distribution, capturing the top-k toward and away tokens per (layer, method).
Runs by default since the model is already loaded (cheap: ~1 matmul per vector). Disable with --no-logit-lens. Standalone usage: python analysis/vectors/logit_lens.py --experiment {exp} --all-traits --save.
Output: {trait}/{model_variant}/logit_lens.json (canonical path, read by the visualization).
Stage 6: Evaluation¶
analysis/vectors/extraction_evaluation.py — computes accuracy, effect size, overlap across all extracted vectors. Saves extraction_evaluation.json.
Scoring and Normalization¶
Raw Projection¶
This is NOT cosine similarity — it scales with activation magnitude. Implemented in core/math.py:projection().
Cross-Trait Normalization¶
Different traits use vectors at different layers. Activation magnitudes vary across layers (typically growing). To make scores comparable:
The mean activation norm at each layer is precomputed during extraction (stored in metadata.json) and loaded at inference by utils/projections.py:normalize_fingerprint().
Cosine Similarity (alternative)¶
Used in some analysis scripts (core/math.py:batch_cosine_similarity). Gives per-token angular alignment without magnitude effects.
Vector Selection¶
utils/vector_selection.py:select_vector() — the single entry point for resolving which vector to use for a trait.
Validation Hierarchy¶
select_vector() walks three tiers in order, returning the first that yields a result:
| Tier | Source | Activates when | VectorResult.source |
|---|---|---|---|
| 1. Causal steering (gold standard) | steering/{trait}/.../results.jsonl |
steering eval has run | 'steering' |
| 2. OOD validation effect size | extraction_evaluation.json (rows with ood_effect_size) |
ood_positive.{json,jsonl,txt} + ood_negative.{json,jsonl,txt} exist |
'ood' |
| 3. In-distribution validation | extraction_evaluation.json combined_score |
extraction has run, no OOD data | 'extraction_eval' |
Tiers 2–3 are filled by analysis/vectors/extraction_evaluation.py, which reads per-layer metrics from each vector's metadata.json.
Constants¶
MIN_COHERENCE = 77 # Steering tier: coherence floor (core/kwargs_configs.py)
POS_THRESHOLD = 60 # --vet-responses: positive scenario score floor
NEG_THRESHOLD = 40 # --vet-responses: negative scenario score ceiling
All three are re-exported from utils/vectors.py (MIN_COHERENCE) for
backward compatibility with existing call sites. Single source of truth
is core/kwargs_configs.py — values were empirically tuned for
gpt-4.1-mini logprob-weighted judging and may not transfer to other
backends (see utils/judge_calibration.py).
min_delta defaults to 0 (no floor); pass via the min_delta= kwarg to filter weak steering effects.
Why Steering Is Ground Truth¶
Probe accuracy on held-out extraction data doesn't guarantee causal relevance. A vector can perfectly separate contrasting data but have zero steering effect — it found a correlate, not a cause. Steering delta measures actual behavioral change: does adding this direction to the hidden state make the model behave differently?
Why OOD Sits Above IID¶
In-distribution validation can reward overfitting to dataset confounds (topic, length, system-prompt phrasing). OOD validation uses prompts that share none of those confounds — if the vector still discriminates on OOD, it's tracking the trait, not the dataset shape. See trait_dataset_creation.md for adding OOD scenarios.
Steering Evaluation¶
The Intervention¶
SteeringHook (core/hooks.py:285-328) adds coef * vector to the hidden state during generation. Multiplication in float32 for precision, then cast to model dtype:
PerSampleSteering evaluates multiple (layer, coefficient) pairs in one forward pass by replicating prompts.
Scoring Steered Outputs¶
Two independent dimensions, both LLM-judged (GPT-4.1-mini with logprob aggregation):
- Trait score (0-100): does the response express the trait? Scored against
definition.txt - Coherence (0-100): is the response grammatical and on-topic? Two-stage: grammar check + relevance check (caps at 50 if off-topic)
Delta Computation¶
Baseline: same questions, no steering. Establishes the model's natural trait level.
Sweep¶
Layers (typically 10-30) × coefficients are swept. Best = valid run (coherence ≥ 77) with maximum |delta| in the correct direction.
Position System Reference¶
| Position | Meaning | Tokens Used | Use Case |
|---|---|---|---|
response[:5] |
First 5 response tokens (mean) | 5 | Default — trait crystallizes early |
response[:3] |
First 3 response tokens (mean) | 3 | Tighter window for strong lock-ins |
response[:] |
All response tokens (mean) | All | When trait develops over full response |
prompt[-1] |
Last prompt token | 1 | Arditi-style — decision state before output |
all[:] |
Entire sequence | All | Rarely used |
Position controls three things simultaneously:
1. Stage 1: how many tokens to generate (response[:5] → max_new_tokens=5)
2. Stage 3: which token indices to capture activations from
3. Storage: subdirectory name via sanitize_position() (response[:5] → response__5)
File Layout¶
experiments/{experiment}/extraction/{trait}/{model_variant}/
├── responses/
│ ├── pos.json, neg.json # Stage 1 output
│ └── metadata.json
├── vetting/
│ └── response_scores.json # Stage 2 (optional)
├── activations/{position}/{component}/
│ ├── train_all_layers.pt # [n_train, n_layers, hidden_dim]
│ ├── val_all_layers.pt # [n_val, n_layers, hidden_dim]
│ └── metadata.json # n_examples, activation_norms, etc.
└── vectors/{position}/{component}/{method}/
├── layer0.pt ... layer47.pt # Unit-norm vectors [hidden_dim]
└── metadata.json # Per-layer: norm, baseline, train_acc
Hook System¶
core/hooks.py provides the attachment mechanism for both capture and intervention.
Architecture¶
HookManager (base)
├── CaptureHook — records outputs[0].detach() during forward pass
├── SteeringHook — adds coef * vector to outputs[0]
├── AblationHook — removes projection: x' = x - (x·r̂)r̂
└── ActivationCappingHook — clamps: h ← h + max(0, τ - ⟨h,v̂⟩)·v̂
HookManager navigates models via dot-separated path strings (e.g., model.layers.16.self_attn.o_proj), handling numeric indices as list access. Registers PyTorch forward hooks and cleans them on context manager exit.
MultiLayerCapture¶
Convenience wrapper that creates one CaptureHook per requested layer. Used by extraction (stage 3) and inference activation capture.
Math Primitives¶
All in core/math.py:
| Function | Formula | Use |
|---|---|---|
projection(acts, vec) |
acts.float() @ (vec.float() / \|\|vec\|\|) |
Raw trait score |
batch_cosine_similarity(acts, vec) |
(acts/\|\|acts\|\|) @ (vec/\|\|vec\|\|) |
Angular alignment |
orthogonalize(v, onto) |
v - (v·onto / \|\|onto\|\|²) · onto |
Remove confound directions (see below) |
project_out_subspace(vecs, basis) |
v - Q @ Q^T @ v (QR-orthonormalized) |
Remove multiple directions at once |
effect_size(pos, neg) |
Cohen's d with pooled std | Separation quality metric |
accuracy(pos, neg) |
Midpoint threshold, mean per-class | Classification metric |
Confound Removal¶
Common confounds and how to handle them:
| Confound | How to Detect | How to Remove |
|---|---|---|
| Length | Trait correlates with verbosity | Orthogonalize to length direction |
| Refusal | Trait correlates with refusal | Orthogonalize to refusal direction |
| Position | Top PCs capture position info | Remove top 1-3 PCs before probe training |
| Tone | Trait correlates with formality | Match tone in contrastive pairs |
from core import orthogonalize
trait_vector = orthogonalize(trait_vector, refusal_vector)
trait_vector = orthogonalize(trait_vector, length_vector)
Instruction Confound Detection¶
If layer 0 probe accuracy ≈ middle layer accuracy, you likely have an instruction confound — the probe is detecting keywords in the input, not the behavioral trait. Solution: use naturally contrasting scenarios without explicit instructions.
Base → Chat Transfer¶
Vectors extracted from the base model transfer to instruct/chat variants because fine-tuning wires existing representations into behavioral circuits without creating them from scratch.
From Ward et al. (2024): ~0.74 cosine similarity between base-derived and chat-derived vectors. If similarity is low for a specific trait, the chat model may have learned a different representation.
Why Classification ≠ Steering¶
The best direction for classification is not the best for steering. This has implications for extraction and monitoring.
Empirical Evidence¶
| Paper | Finding |
|---|---|
| TalkTuner (Chen et al. 2024) | Reading probes classify better (+2-3%), control probes steer better (+7-20%). Same data, different token positions. |
| ITI (Li et al. 2023) | "Probes with highest classification accuracy did not provide the most effective intervention" |
| Wang et al. 2025 (LoRA OOCR) | "Learned vectors have LOW cosine similarity with naive positive-minus-negative vectors" |
| Yang & Buzsaki 2024 | Different layers optimal for reading vs writing |
Geometric Explanation¶
Classification finds the hyperplane that separates classes (normal to decision boundary). Steering moves a point from class A to class B (following the data manifold). The direction that best separates classes isn't the direction that naturally connects them.
When you steer off-manifold, the model sees activations it never encountered during training — behavior becomes unpredictable.
The tradeoff is asymmetric: steering-optimal directions still classify well, but classification-optimal directions steer poorly. A manifold-following direction necessarily crosses class boundaries; a boundary-normal direction doesn't necessarily follow the manifold.
Implications¶
- Extraction evaluation metrics (effect size, accuracy) may not predict steering effectiveness. Expect divergence between
extraction_evaluation.pyandsteering/run_steering_eval.py. - Token position matters. Extract from where the model commits to behavior, not from a classification prompt.
- For monitoring, use steering-validated vectors. If monitoring predicts behavior, the vector should be causally linked to behavior (steering works), not just correlated (classification works).
What's Established vs Assumed¶
Established: - Mean diff / probe produce usable directions - Adding vector changes behavior (steering works) - Projection correlates with behavior - Vectors don't transfer across model families - Single layer sufficient for steering
Assumed (worth testing): - First response tokens are optimal position - Read direction = write direction - Probe ≈ mean diff direction - Base→chat transfer always works
Unknown (research opportunities): - Optimal position (systematic sweep) - Vector geometry (trait orthogonality) - Dynamics interpretation (what do velocity/acceleration mean?)
Related Documentation¶
docs/trait_dataset_creation.md— dataset creation guide (scenarios, definitions, steering questions)docs/steering_guide.md— steering pipeline guidedocs/core_reference.md— core/ API referenceanalysis/README.md— analysis scripts