HuggingFace Chat Templates¶
Reference for chat template behavior across models.
Key Concepts¶
Generation Boundary¶
Generation starts after the assistant prefix added by add_generation_prompt=True:
Gemma-2-it:
0: <bos> ─┐
1: <start_of_turn> │
2: user │
3: \n │ prompt (prefill)
4-N: [user content] │
N+1: <end_of_turn> │
N+2: \n │
N+3: <start_of_turn> │
N+4: model │
N+5: \n ─┘
N+6+: [generation] ─── response[0] starts here
Detection method (universal, works for all models):
prompt_ids = tokenizer.apply_chat_template(messages, add_generation_prompt=True)
generation_start = len(prompt_ids)
Key Parameters¶
| Parameter | Purpose |
|---|---|
add_generation_prompt=True |
Adds assistant prefix. Generation starts after. |
continue_final_message=True |
For prefilling — continues partial assistant message |
return_assistant_tokens_mask=True |
Returns binary mask — requires {% generation %} in template |
tools=[...] |
Pass functions or JSON schemas for tool calling |
BOS Token Handling¶
Chat templates include BOS token. Our tokenization functions handle this automatically:
# tokenize_batch() auto-detects BOS from text content
from utils.model import tokenize_batch, format_prompt
formatted = format_prompt("Hello", tokenizer, use_chat_template=True)
result = tokenize_batch([formatted], tokenizer) # Auto-detects, no double BOS
# Validation catches mistakes
tokenize_batch([formatted], tokenizer, add_special_tokens=True) # Raises ValueError!
How auto-detection works:
- Checks if text starts with tokenizer.bos_token (e.g., <bos> for Gemma, <|begin_of_text|> for Llama)
- If BOS present: add_special_tokens=False (don't add another)
- If no BOS: add_special_tokens=True (add one)
- Validates output for double-BOS and raises clear error if detected
This works for all models including Qwen (no BOS token) since the check short-circuits.
Model Behaviors¶
System Prompt Support¶
| Model | Support | Behavior |
|---|---|---|
| Gemma-2-it | ❌ | Raises TemplateError |
| Llama-3.1-Instruct | ✅ | Appends to hardcoded default ("Cutting Knowledge Date...") |
| Qwen2.5-Instruct | ✅ | Replaces default ("You are Qwen...") |
Llama's default is hardcoded in template — cannot be removed without editing template.
Generation Mask Support¶
Most models do not support return_assistant_tokens_mask:
| Model | {% generation %} in template |
|---|---|
| Gemma-2-it | ❌ |
| Llama-3.1-Instruct | ❌ |
| Qwen2.5-Instruct | ❌ |
| Qwen3-* | ✅ |
Use the "compare lengths" method instead for universal generation boundary detection.
Tool Calling¶
| Model | Support |
|---|---|
| Gemma-2-it | ❌ Ignored |
| Llama-3.1-Instruct | ✅ |
| Qwen2.5-Instruct | ✅ |
Tool calling flow:
1. Pass tools= to apply_chat_template()
2. Model outputs JSON: {"name": "func", "arguments": {...}}
3. Execute tool, append result as {"role": "tool", "content": "result"}
4. Continue generation