traitinterp¶

Extract, monitor, and steer LLM behavioral traits token-by-token during generation.

What this does¶

Extract -- Train a linear probe that detects a behavioral trait (sycophancy, deception, formality, etc.) from naturally contrasting scenarios
Monitor -- Project hidden states onto that probe token-by-token during generation
Steer -- Add the probe direction during inference to amplify or suppress the trait

Trait datasets are model-agnostic. Extract once, apply to any model.

git clone https://github.com/ewernn/traitinterp.git && cd traitinterp
pip install -r requirements.txt
export HF_TOKEN=your_token  # for gated models

Extract your first trait:

python extraction/run_extraction_pipeline.py \
    --experiment starter \
    --traits starter_traits/sycophancy

Monitor traits during generation:

python inference/run_inference_pipeline.py \
    --experiment starter \
    --prompt-set starter_prompts/general

Visualize:

python visualization/serve.py  # http://localhost:8000

Section	Description
Guides
Extraction	Extract trait vectors from contrasting scenarios
Inference	Per-token monitoring, projection modes
Steering	Causal validation via steering, coefficient search
Creating Datasets	Scenario design, definitions, iteration
CLI Reference
Extraction CLI	`run_extraction_pipeline.py` flags and usage
Inference CLI	`run_inference_pipeline.py` flags and usage
Steering CLI	`run_steering_eval.py` flags and usage
Analysis CLI	Analysis scripts flags and usage
Configuration
Experiment Setup	`config.json`, model variants, paths
Trait Format	Dataset file format (`positive.txt`, `steering.json`, etc.)
API Reference
Core API	Types, hooks, methods, math primitives
Response Schema	Unified response format across pipelines
Chat Templates	HuggingFace chat template behavior
Technical
Architecture	Design principles, directory responsibilities, experiment schema
Methodology	How we extract and use vectors