Steering¶
Steering evaluation -- validate trait vectors via causal intervention with adaptive coefficient search.
For search algorithm details and interpretation, see Steering Guide.
Flags¶
Trait Selection¶
Mutually exclusive: provide exactly one of --traits, --vector-from-trait, or --rescore.
| Flag | Type | Default | Description |
|---|---|---|---|
--experiment |
str | required | Experiment name |
--traits |
str | Comma-separated traits, e.g. starter_traits/sycophancy |
|
--vector-from-trait |
str | Single trait as experiment/category/trait (cross-experiment vectors) |
|
--rescore |
str | Re-score existing responses for a trait (no GPU needed) |
Evaluation¶
| Flag | Type | Default | Description |
|---|---|---|---|
--prompt-set |
str | steering |
Prompt set for evaluation |
--questions-file |
str | None | Path to custom questions file |
--no-custom-prompt |
flag | off | Skip trait-specific eval prompt (use default) |
--eval-prompt-from |
str | None | Load eval prompt from a trait path |
--trait-judge |
str | None | Path to custom judge configuration |
--subset |
int | 5 |
Number of prompts to evaluate |
--max-new-tokens |
int | 64 |
Max tokens per steered response |
--direction |
str | None | Force steering direction: positive or negative |
--no-relevance-check |
flag | off | Skip relevance filtering of prompts |
--no-custom-prompt, --eval-prompt-from, and --trait-judge are mutually exclusive.
Search Parameters¶
| Flag | Type | Default | Description |
|---|---|---|---|
--layers |
str | 30%-60% |
Layers to search (supports percentages, ranges, comma-separated) |
--trait-layers |
nargs | Per-trait layer overrides, format: TRAIT:LAYERS |
|
--coefficients |
str | None | Manual coefficients (comma-separated, skips adaptive search) |
--search-steps |
int | 5 |
Number of search iterations |
--up-mult |
float | 1.3 |
Multiplier for upward coefficient search |
--down-mult |
float | 0.85 |
Multiplier for downward search |
--start-mult |
float | 0.7 |
Starting coefficient multiplier |
--momentum |
float | 0.1 |
Search momentum |
Vector Selection¶
| Flag | Type | Default | Description |
|---|---|---|---|
--method |
str | probe |
Extraction method |
--component |
str | residual |
Activation component |
--position |
str | None | Token position override (auto-detected from extraction if None) |
--vector-experiment |
str | None | Load vectors from a different experiment |
--extraction-variant |
str | None | Model variant used during extraction |
Model & Hardware¶
| Flag | Type | Default | Description |
|---|---|---|---|
--model-variant |
str | None | Model variant from experiment config (default: config.defaults.application) |
--load-in-4bit |
flag | off | 4-bit quantization (requires CUDA + bitsandbytes) |
--bnb-4bit-quant-type |
str | nf4 |
Quantization type: nf4 or fp4 |
--min-coherence |
float | 77 |
Minimum coherence threshold for steered responses |
--backend |
str | auto |
Model backend: auto, local, vllm |
Pipeline Control¶
| Flag | Type | Default | Description |
|---|---|---|---|
--baseline-only |
flag | off | Score unsteered responses only (no steering) |
--force |
flag | off | Force recomputation of existing results |
--no-batch |
flag | off | Evaluate traits sequentially instead of batched |
--regenerate-responses |
flag | off | Force regenerate baseline responses |
--save-responses |
str | best |
Which steered responses to save: all, best, none |
--ablation |
int | None | Ablation mode: remove direction at specified layer |
--dry-run |
flag | off | Print configuration without executing |
Examples¶
Run steering evaluation for a single trait:
Batch-evaluate multiple traits:
python steering/run_steering_eval.py \
--experiment gemma-2-2b \
--traits starter_traits/sycophancy,starter_traits/refusal
Use vectors from a different experiment:
python steering/run_steering_eval.py \
--experiment gemma-2-2b \
--vector-from-trait other_exp/starter_traits/sycophancy
Evaluate specific coefficients (skip search):
python steering/run_steering_eval.py \
--experiment gemma-2-2b \
--traits starter_traits/sycophancy \
--coefficients 0.5,1.0,1.5,2.0
Score unsteered baselines only:
python steering/run_steering_eval.py \
--experiment gemma-2-2b \
--traits starter_traits/sycophancy \
--baseline-only --save-responses all
Ablation -- remove trait direction at a layer:
python steering/run_steering_eval.py \
--experiment gemma-2-2b \
--traits starter_traits/sycophancy \
--ablation 25
Re-score existing responses (no GPU required):
python steering/run_steering_eval.py \
--experiment gemma-2-2b \
--rescore starter_traits/sycophancy
Tip
Use --dry-run to preview the resolved configuration (layers, coefficients, model variant) before launching a full evaluation.