traitinterp¶
Extract, monitor, and steer LLM behavioral traits token-by-token during generation.
What this does¶
- Extract -- Train a linear probe that detects a behavioral trait (sycophancy, deception, formality, etc.) from naturally contrasting scenarios
- Monitor -- Project hidden states onto that probe token-by-token during generation
- Steer -- Add the probe direction during inference to amplify or suppress the trait
Trait datasets are model-agnostic. Extract once, apply to any model.
Quick start¶
git clone https://github.com/ewernn/traitinterp.git && cd traitinterp
pip install -r requirements.txt
export HF_TOKEN=your_token # for gated models
Extract your first trait:
python extraction/run_extraction_pipeline.py \
--experiment starter \
--traits starter_traits/sycophancy
Monitor traits during generation:
python inference/run_inference_pipeline.py \
--experiment starter \
--prompt-set starter_prompts/general
Visualize:
Documentation¶
| Section | Description |
|---|---|
| Guides | |
| Extraction | Extract trait vectors from contrasting scenarios |
| Inference | Per-token monitoring, projection modes |
| Steering | Causal validation via steering, coefficient search |
| Creating Datasets | Scenario design, definitions, iteration |
| CLI Reference | |
| Extraction CLI | run_extraction_pipeline.py flags and usage |
| Inference CLI | run_inference_pipeline.py flags and usage |
| Steering CLI | run_steering_eval.py flags and usage |
| Analysis CLI | Analysis scripts flags and usage |
| Configuration | |
| Experiment Setup | config.json, model variants, paths |
| Trait Format | Dataset file format (positive.txt, steering.json, etc.) |
| API Reference | |
| Core API | Types, hooks, methods, math primitives |
| Response Schema | Unified response format across pipelines |
| Chat Templates | HuggingFace chat template behavior |
| Technical | |
| Architecture | Design principles, directory responsibilities, experiment schema |
| Methodology | How we extract and use vectors |