Skip to content

traitinterp

Extract, monitor, and steer LLM behavioral traits token-by-token during generation.

Live demo | GitHub


What this does

  1. Extract -- Train a linear probe that detects a behavioral trait (sycophancy, deception, formality, etc.) from naturally contrasting scenarios
  2. Monitor -- Project hidden states onto that probe token-by-token during generation
  3. Steer -- Add the probe direction during inference to amplify or suppress the trait

Trait datasets are model-agnostic. Extract once, apply to any model.


Quick start

git clone https://github.com/ewernn/traitinterp.git && cd traitinterp
pip install -r requirements.txt
export HF_TOKEN=your_token  # for gated models

Extract your first trait:

python extraction/run_extraction_pipeline.py \
    --experiment starter \
    --traits starter_traits/sycophancy

Monitor traits during generation:

python inference/run_inference_pipeline.py \
    --experiment starter \
    --prompt-set starter_prompts/general

Visualize:

python visualization/serve.py  # http://localhost:8000

Documentation

Section Description
Guides
Extraction Extract trait vectors from contrasting scenarios
Inference Per-token monitoring, projection modes
Steering Causal validation via steering, coefficient search
Creating Datasets Scenario design, definitions, iteration
CLI Reference
Extraction CLI run_extraction_pipeline.py flags and usage
Inference CLI run_inference_pipeline.py flags and usage
Steering CLI run_steering_eval.py flags and usage
Analysis CLI Analysis scripts flags and usage
Configuration
Experiment Setup config.json, model variants, paths
Trait Format Dataset file format (positive.txt, steering.json, etc.)
API Reference
Core API Types, hooks, methods, math primitives
Response Schema Unified response format across pipelines
Chat Templates HuggingFace chat template behavior
Technical
Architecture Design principles, directory responsibilities, experiment schema
Methodology How we extract and use vectors