Experiment Setup¶

Everything needed to configure and run a new experiment.

Experiment directory¶

Experiments live in experiments/{name}/. The directory is created automatically when you run any pipeline script -- you only need to create config.json manually beforehand.

experiments/my-experiment/
├── config.json              # You create this (required)
├── extraction/              # Created by extraction pipeline
├── inference/               # Created by inference pipeline
└── steering/                # Created by steering pipeline

config.json¶

The only file you create manually. Defines which model(s) to use.

Minimal example¶

A single model variant is enough to get started:

{
  "model_variants": {
    "my_model": { "model": "{huggingface_org}/{model_name}" }
  }
}

Full example¶

Multiple variants with defaults and a LoRA adapter:

{
  "defaults": {
    "extraction": "base",
    "application": "instruct"
  },
  "model_variants": {
    "base": { "model": "{huggingface_org}/{base_model}" },
    "instruct": { "model": "{huggingface_org}/{instruct_model}" },
    "finetuned": {
      "model": "{huggingface_org}/{instruct_model}",
      "lora": "{huggingface_org}/{lora_adapter}"
    }
  }
}

Fields¶

Field	Required	Description
`model_variants`	Yes	Map of variant name to model spec. Each spec requires a `model` field (HuggingFace model ID). Optionally includes `lora` (HuggingFace adapter ID or local path).
`defaults.extraction`	No	Variant used for trait vector extraction (typically the base model). Falls back to the first variant in `model_variants` if omitted.
`defaults.application`	No	Variant used for inference and steering (typically the instruct model). Falls back to the first variant if omitted.

Variant naming

Variant names are free-form strings used as directory names throughout the experiment. Choose short, descriptive names (base, instruct, rank32, etc.).

LoRA paths

The lora field accepts either a HuggingFace adapter ID (e.g., ModelOrganismsForEM/Qwen2.5-14B-Instruct_bad-medical-advice) or a local filesystem path. HuggingFace IDs are downloaded automatically; local paths must exist at runtime.

Environment variables¶

Copy .env.example to .env and fill in the values you need.

Variable	Required	Description
`HF_TOKEN`	For gated models	HuggingFace access token. Required for gated models (Llama, Gemma, etc.). Not needed for open models like Qwen.
`OPENAI_API_KEY`	For LLM judge	Used by vetting (extraction `--vet-responses`), scoring (steering evaluation), and coherence checks. Not needed for basic extraction or inference.
`R2_ACCESS_KEY_ID`	For R2 sync	Cloudflare R2 credentials for downloading/uploading experiment data.
`R2_SECRET_ACCESS_KEY`	For R2 sync	Cloudflare R2 credentials.
`R2_ENDPOINT`	For R2 sync	R2 endpoint URL.
`R2_BUCKET_NAME`	For R2 sync	R2 bucket name (default: `trait-interp-bucket`).
`EXPERIMENTS_DIR`	No	Redirects all experiment I/O to a different location. Useful for storing large experiment data on a separate drive.

OPENAI_API_KEY

Steering evaluation will fail with a hard error if this is not set. Basic extraction (without --vet-responses) and inference do not require it.

Model config (optional)¶

Files in config/models/{slug}.yaml provide model architecture metadata. These are not required -- the pipeline auto-detects architecture from HuggingFace model configs. 23 model configs ship with the repo. You only need to create one if auto-detection of base vs instruct fails for your model.

Example¶

huggingface_id: {huggingface_org}/{model_name}
variant: instruct          # base | instruct
supports_system_prompt: true
num_hidden_layers: 32
hidden_size: 4096

Key fields¶

Field	Description
`variant`	`base` or `instruct` -- controls extraction defaults (position, chat template). This is the main reason to create a config.
`supports_system_prompt`	Whether the model's chat template accepts a system message
`num_hidden_layers`	Number of transformer layers (used for layer range defaults like `--layers "30%-60%"`)
`hidden_size`	Residual stream dimension

Most other fields (model_type, num_attention_heads, intermediate_size, etc.) are auto-detected from HuggingFace and rarely need manual specification.

The `starter` experiment¶

The repo ships with experiments/starter/config.json preconfigured with two models:

{
  "defaults": {
    "extraction": "instruct",
    "application": "instruct"
  },
  "model_variants": {
    "base": {
      "model": "Qwen/Qwen3.5-9B-Base"
    },
    "instruct": {
      "model": "Qwen/Qwen3.5-9B"
    }
  }
}

Use it for first-time testing:

# Extract sycophancy vectors using the starter experiment
python extraction/run_extraction_pipeline.py \
    --experiment starter \
    --traits starter_traits/sycophancy

# Run inference monitoring
python inference/run_inference_pipeline.py \
    --experiment starter \
    --prompt-set starter_prompts/general

No HF_TOKEN required

Both starter variants (base and instruct) use Qwen models, which are open and do not require a HuggingFace token. Swap in a gated model (Llama, Gemma, etc.) and you'll need HF_TOKEN set.

Next steps¶

Trait Dataset Format -- file formats for trait datasets
Extraction Guide -- full extraction pipeline walkthrough
Inference Guide -- per-token monitoring setup