Response Schema¶

Unified flat schema for response data across extraction, inference, and steering pipelines.

Schema¶

{
  "prompt": "...",
  "response": "...",
  "system_prompt": null,
  "tokens": ["<bos>", "user", ...],
  "token_ids": [2, 1645, ...],
  "prompt_end": 45,
  "inference_model": "google/gemma-2-2b-it",
  "prompt_note": "refusal_suppression",
  "capture_date": "2026-01-13T...",
  "tags": ["success"],
  "trait_score": null,
  "coherence_score": null
}

Core Fields¶

Field	Type	Required	Description
`prompt`	string	Yes	User input text (includes chat template tokens)
`response`	string	Yes	Model output text
`system_prompt`	string \| null	No	System prompt if used, null otherwise
`tokens`	string[] \| null	No	Token strings for full sequence (prompt + response)
`token_ids`	int[] \| null	No	Token IDs for full sequence
`prompt_end`	int \| null	No	Token index where response starts (required if tokens present)

Optional Metadata Fields¶

Field	Type	Description
`inference_model`	string	Model used for generation (e.g., "google/gemma-2-2b-it")
`prompt_note`	string \| null	Category/note from prompt set (e.g., "roleplay", "refusal_suppression")
`capture_date`	string	ISO timestamp when response was captured
`tags`	string[]	User-defined tags (e.g., ["success"])

Steering-Only Fields¶

Field	Type	Description
`trait_score`	float \| null	Trait projection score
`coherence_score`	float \| null	Coherence/quality score

Multi-Turn Rollout Fields¶

Used by inference/convert_rollout.py for agent-interp-envs rollouts. The entire conversation is stored as "prompt" with empty "response". capture_activations.py detects response: "" + token_ids present and uses stored IDs directly (skips re-tokenization).

Field	Type	Description
`turn_boundaries`	object[] \| null	Token boundaries per message: `[{role, token_start, token_end, ...}]`
`sentence_boundaries`	object[] \| null	Per-sentence metadata: `[{sentence_num, token_start, token_end, cue_p}]`
`source`	object \| null	Provenance: `{type, environment, messages_path}`

Turn boundary entries may include: - has_thinking (bool) — assistant turn contains thinking/reasoning - has_tool_calls (bool) — assistant turn contains tool calls - tool_names (string[]) — names of tools called - tool_call_id (string) — for tool result messages - tool_name (string) — for tool result messages

Sentence boundary entries (for thought branches / unfaithful CoT analysis): - sentence_num (int) — 0-indexed sentence position in CoT - token_start (int) — response-relative start token - token_end (int) — response-relative end token - cue_p (float) — transplant resampling probability (0.0–1.0, from ground truth CSV)

Future Fields¶

Field	Type	Description
`prefill_end`	int \| null	Token index where model generation starts (for prefilling)

Removed Fields¶

These fields are derivable from the file path or experiment config: - inference_experiment → from path (experiments/{experiment}/...) - prompt_set → from path (.../responses/{prompt_set}/...) - prompt_id → from filename ({id}.json) - lora_adapter → from model variant in experiment config

Usage by Pipeline¶

Extraction¶

Stores arrays of responses per trait. Tokens not stored (only counts in legacy format).

[
  {
    "prompt": "The conversation started with...",
    "response": "I continued by explaining...",
    "system_prompt": null
  },
  ...
]

Sibling file: metadata.json with generation parameters.

Inference¶

One file per prompt_id. Full tokenization and metadata stored (flat, no wrapper).

{
  "prompt": "<bos><start_of_turn>user\nHow do I...",
  "response": "You can do this by...",
  "system_prompt": null,
  "tokens": ["<bos>", "<start_of_turn>", "user", ...],
  "token_ids": [2, 106, 1645, ...],
  "prompt_end": 45,
  "inference_model": "google/gemma-2-2b-it",
  "prompt_note": "roleplay",
  "capture_date": "2026-01-13T00:58:35.871205",
  "tags": ["success"]
}

Steering¶

Stores arrays of responses per evaluation run. Includes scores.

[
  {
    "prompt": "How does the X7 processor work?",
    "response": "I don't have information about...",
    "system_prompt": null,
    "trait_score": 0.003,
    "coherence_score": 50.0
  },
  ...
]

Loading Responses¶

Use the adapter to handle both array and single-object files:

def load_responses(path: Path) -> list[dict]:
    """Load responses from any format, return list."""
    with open(path) as f:
        data = json.load(f)
    return data if isinstance(data, list) else [data]

function loadResponses(data) {
    return Array.isArray(data) ? data : [data];
}

Tokenization¶

Tokenization happens at generation time. Storage is optional.

Pipeline	Tokenizes	Stores tokens
Extraction	Yes (generation)	No
Inference	Yes (generation)	Yes
Steering	Yes (generation)	No

To re-tokenize from stored text: - Use same model/tokenizer - tokenize_batch() auto-detects BOS from text content (no manual add_special_tokens needed)

Activations¶

Activations are stored separately from responses:

Binary .pt files (PyTorch tensors)
Linked by prompt_id or response index
Captured on-demand, not by default

Locations: - Extraction: activations/{position}/{component}/train_all_layers.pt - Inference: raw/residual/{prompt_set}/{prompt_id}.pt - Steering: not stored (re-run inference if needed)

Annotations¶

Annotations stored in sibling *_annotations.json files (e.g., baseline.json → baseline_annotations.json).

Schema¶

{
  "annotations": [
    {"idx": 0, "spans": [{"span": "72% of Americans"}, {"span": "The Godfather", "category": "movie"}]},
    {"idx": 5, "spans": [{"span": "vote in the election", "category": "voting"}], "note": "egregious example"}
  ]
}

Structure: - annotations: Array of annotation entries (sparse - only include annotated responses) - Each entry has idx (response index) and spans (array of span objects)

Required fields: - idx: Response index (0-based) - spans: Array of span objects, each with span key

Optional fields on span: - category: Annotation type/label - intensity: Severity (1-5) - Any other field for your use case

Optional fields on entry: - note: Note about this response - borderline: Array of span objects (same format as spans) for plausible but ambiguous cases. Kept separate from spans so primary eval uses strict annotations only. Each borderline span should include category (which bias it's borderline for) and note (why it's ambiguous). - Any other field

Optional root fields: - metadata: Provenance info ({"created": "2026-01-20", "author": "..."}) - categories: Category definitions ({"movie": "Unprompted movie recommendations"})

Conversion¶

Text spans are converted at runtime: - Frontend: spansToCharRanges(response, spans) → character indices for highlighting - Backend: spans_to_token_ranges(response, spans, tokenizer) → token indices for analysis

Converters in utils/annotations.py (Python) and visualization/core/annotations.js (JS).

Example¶

Response file baseline.json:

[
  {"prompt": "What caused the French Revolution?", "response": "The French Revolution (1789)..."},
  {"prompt": "Explain quantum computing", "response": "Quantum computing uses..."}
]

Annotation file baseline_annotations.json:

{
  "annotations": [
    {"idx": 0, "spans": [{"span": "(population: 67 million)", "category": "population"}]}
  ]
}

File Structure¶

Response files use: - Arrays (extraction, steering) - Single objects (inference)