Methodology¶
How to extract and use trait vectors.
A trait vector is a single direction in a model's residual stream at a particular layer — a 1D vector with the same dimensionality as the hidden state. Projecting a token's activations onto this direction gives a scalar score: positive means the model is expressing the trait, negative means suppressing it. Extracting one means finding which direction in activation space corresponds to the trait you care about.
| Step | What | Code |
|---|---|---|
| 1. Generate data | Create labeled datasets that isolate the trait | datasets/traits/ |
| 2. Extract | Find the direction in activation space | extraction/run_extraction_pipeline.py |
| 3. Validate | Confirm the vector causally affects behavior | steering/run_steering_eval.py |
| 4. Run experiments | Monitor, compare, intervene, evaluate | inference/run_inference_pipeline.py |
1. Generate data¶
To find a trait direction, you need data where the model's activations differ along the dimension you care about — so the extraction method can isolate what's unique to that trait.
There are different strategies for creating this separation:
- Contrastive pairs: matched positive/negative examples that differ only on the trait axis. The extraction method finds what separates them.
- Single-class with global baseline: generate many examples of one trait, then subtract the mean across all traits. The direction is what makes this trait different from the average.
- Corpus mining: find naturally occurring text where the trait is present vs absent, label it, and extract.
Whichever strategy you use, the same principles apply:
- The trait should be load-bearing in the text. If the model can process or generate the text without engaging its representation of the trait, the activations won't encode it.
- Isolate one axis. If your trait-present examples also differ in length, topic, register, or emotional intensity, the extracted direction will encode those confounds, not the trait.
- Negatives should be active, not bland. A vivid positive paired with a neutral negative extracts "intensity," not the trait. Good negatives actively exhibit the opposite — compliance for refusal, transparency for concealment, cheerful moving-on for nostalgia.
- Diversity across surface features. Vary topics, names, settings, sentence lengths. 15-30 examples spanning different contexts is a reasonable minimum.
Each trait in traitinterp needs: labeled data, plus a definition.txt scoring rubric used downstream for validation. See datasets/traits/ for the file structure, or datasets/traits/starter_traits/ for ready-to-use traits (sycophancy, hallucination, concealment, etc.).
Our approach: contrastive document completion on base models
We write matched prefix pairs where a base model's natural completion either expresses or doesn't express the trait — not because we told it to, but because the document it's completing demands that continuation. **Example: Refusal** Positive (elicits refusal): Negative (elicits compliance): **Example: Helpfulness Intent** Positive (genuinely helpful): Negative (unhelpful/mismatched): Why base models: they've learned concepts like deception, helpfulness, and refusal from pretraining data. Fine-tuning teaches *when* to apply these concepts, not the concepts themselves [@platonic]. Extracting from base avoids fine-tuning-specific confounds and produces vectors that transfer to fine-tuned variants. Why document completion: no instruction-following confounds, captures genuine trait expression rather than compliance, and model behavior follows naturally from the setup.:::dataset /datasets/traits/starter_traits/refusal/positive.jsonl "View refusal positive scenarios":::
:::dataset /datasets/traits/starter_traits/refusal/negative.jsonl "View refusal negative scenarios":::
See trait_dataset_creation.md for the full dataset design guide, including the decision tree for trait categories, scenario design principles, and common failure modes.
Alternative: story-based extraction (Anthropic Emotion Concepts, 2026)
[@emotion_concepts] generate short stories where characters experience specified emotions, extract residual stream activations averaged across story tokens, and subtract the global mean across all emotions. No matched negative pairs — the contrast is "this emotion vs the average of all 171 emotions." They then project out the top principal components from emotionally neutral transcripts to remove confounds. This approach produces broad coverage (171 emotions from synthetic stories) but places the model in narrator mode — representing someone *else's* trait from the outside rather than expressing it directly.Alternative: instruction-based elicitation (Persona Vectors, 2025)
[@persona_vectors] condition the model with system prompts like "You are evil" and extract from chat-format responses. This uses the instruct model directly, averaging activations across all response tokens. This approach is straightforward but extracts how the model *performs* the trait under instruction rather than its underlying representation. The resulting vectors may conflate the trait with instruction-following behavior.Alternative: naturally occurring corpus contrasts
Label existing documents or conversations where the trait is present vs absent, and extract from those. No synthetic data generation needed. Harder to control for confounds (real-world data varies on many dimensions), but avoids any generation artifacts. Works best when you have access to naturally labeled data (e.g., deceptive vs honest statements with known ground truth).2. Extract¶
The goal is to find where in activation space your behavior of interest lives. Run labeled data through the model, capture activations, and fit a direction that separates trait-present from trait-absent.
Where to capture. The trait signal lives at different positions depending on the behavior. The last prompt token before the assistant responds contains a compressed plan for how the model will respond. Early response tokens capture behavioral commitment before surface style takes over. Later tokens carry more for persistent traits like tone. Traitinterp supports multiple positions: first 5 response tokens (response[:5]), last prompt token (prompt[-1]), and more via utils/positions.py.
Which layers. Trait signal typically lives in the early-middle to middle-late layers. Early layers handle input formatting, late layers handle output formatting — neither carries much behavioral signal. A sweep across the middle 25–70% of model depth is a reasonable starting range.
How to fit the direction. Given activation vectors from positive and negative examples, find the direction that best separates them:
- Logistic regression — optimizes a classification boundary. Default in traitinterp.
- Mean difference — subtracts centroids. Simpler and faster. Similar in practice.
Extracted vectors are normalized to unit norm, so projection scores reflect direction alignment rather than vector magnitude.
python extraction/run_extraction_pipeline.py \
--experiment {experiment} --traits {category}/{trait}
Our approach: probe on base model at response[:5]
We use logistic regression on residual stream activations at the first 5 response tokens of a base model. Early tokens capture the model's behavioral commitment — it has decided what kind of completion to produce but hasn't drifted into surface stylistic choices. Probe is the default because it optimizes for classification rather than centroid separation, which handles noise in the activation space.Alternative: mean_diff
Centroid subtraction (mean of positive activations minus mean of negative). Simpler, faster, and works well for some trait types — particularly epistemic and emotional traits where probe can collapse to degenerate attractors. See [effect-size-vs-steering](viz_findings/effect-size-vs-steering.md) for the trait-type comparison.Alternative: Arditi-style extraction at prompt[-1]
[@arditi_refusal] extract at the final prompt token, before generation starts. The model has already "decided" whether to refuse by this point, so the hidden state captures the decision rather than its expression. This produces vectors that ablate refusal completely (100% bypass) rather than modulating it — a different tool for a different purpose. See [comparison-arditi-refusal](viz_findings/comparison-arditi-refusal.md).Alternative: Anthropic Emotion Concepts extraction
[@emotion_concepts] average activations across all story tokens (from the 50th token onward), subtract the global mean across emotions, then project out top PCs from neutral transcripts. No paired contrasts and no probe fitting — the direction is simply the denoised mean activation for that emotion.3. Validate¶
A direction that separates your data doesn't guarantee it captures the trait. The vector might encode a confound in your scenarios (topic, length, intensity) rather than the behavior itself.
Two main validation approaches:
- Held-out classification accuracy — does the vector separate examples it wasn't trained on? Necessary but not sufficient — a vector can classify well without causally affecting behavior.
- Steering — add the vector to the model's activations during generation. If the output changes in the expected direction, the vector is causally linked to the behavior. Stronger evidence than classification alone. The steering coefficient should be scaled proportional to the activation magnitude at the target layer — too small and the effect is invisible, too large and coherence collapses.
select_vector() walks both, plus an OOD tier in between, automatically. See extraction_guide.md#vector-selection for the hierarchy.
Our approach: steer on instruct model with LLM judge scoring
We apply the vector at varying coefficients during generation on an instruct model and score outputs with an LLM judge on two dimensions: **trait expression** (does the output exhibit the trait?) and **coherence** (is the output still on-topic and fluent?). We sweep across layers and select the best by steering results, not probe accuracy. Why instruct models: they have consistent response patterns, giving cleaner causal signal than base model completions. See [steering_guide.md](steering_guide.md) for the full steering evaluation process, coefficient search, and troubleshooting.Alternative: held-out classification accuracy only
Measure how well the vector separates held-out positive/negative examples. Fast and doesn't require generation, but a vector can achieve high classification accuracy by encoding confounds in your dataset rather than the trait itself. Useful as a sanity check, not as a standalone validation.4. Run experiments¶
A validated trait vector is a reusable measurement tool. Project activations onto it to get a scalar trait score at any token, layer, or model variant.
What you can do with it:
- Monitor — project token-by-token during generation to watch the model's internal state evolve
- Compare — run the same text through different model variants and diff their trait profiles
- Intervene — add or subtract the vector during generation to modify behavior
- Evaluate — score model outputs on trait dimensions without a full LLM judge pass