@Teen-Different on Hugging Face: "Safety Alignment Collapses Without apply_chat

posted an update 1 day ago

Post

359

Safety Alignment Collapses Without apply_chat_template(): An Empirical Study

This weekend, I ran an experiment on the safety alignment of several small-scale open models (Qwen2.5, Qwen3, Gemma-3, SmolLM). My objective was to measure the robustness of refusal mechanisms when deviating from canonical chat templates.

The finding: Safety guarantees effectively collapse when apply_chat_template() is omitted.

METHODOLOGY

I evaluated models in two states:

• In-Distribution: Input wrapped in standard <|im_start|> instruction tokens
• Out-of-Distribution: Input provided as a raw string

For scalable evaluation, I used Qwen3Guard-Gen-4B as an automated judge, classifying responses as Safe, Unsafe, or Controversial.

KEY FINDINGS: REFUSAL COLLAPSE

When "Assistant" formatting tokens are removed, models undergo a distributional shift—reverting from a helpful assistant to a raw completion engine.

Gemma-3: 100% refusal (aligned) → 60% (raw)
Qwen3: 80% refusal (aligned) → 40% (raw)
SmolLM2-1.7B: 0% → 0% (no safety tuning to begin with)

QUALITATIVE FAILURES

The failure modes were not minor. Without the template, models that previously refused harmful queries began outputting high-fidelity harmful content:

• Explosives: Qwen3 generated technical detonation mechanisms
• Explicit content: Requests flatly refused by aligned models were fulfilled with graphic narratives by unaligned versions

This suggests instruction tuning acts as a "soft mask" over the pre-training distribution rather than removing harmful latent knowledge.

👉 Read the full analysis: https://teendifferent.substack.com/p/apply_chat_template-is-the-safety
💻 Reproduction Code: https://github.com/REDDITARUN/experments/tree/main/llm_alignment

JLouisBiz

1 day ago

 The whole safety‑collapse argument is kinda moot when you can just hit up https://z-library.sk/s/explosive? and download any “explosives” manual you want—complete with ISBNs, free access, and no need to waste time testing whether an LLM will refuse it. If the books are openly available, why bother measuring how often a model refuses to spit out detonation details? The real question is why anyone would spend resources on alignment experiments when the source material is already out there for anyone to grab. It’s a waste of compute to prove that removing the chat template makes models more “raw,” when the raw data is already freely downloadable.

unmodeled-tyler

1 day ago

Awesome! Love this research angle and the write up! Nice work!

Join the conversation