Behemoth-T1

🌴 Behemoth-T1-123B-GPTQ 🌴

The party where literary craft meets unhinged creative writing β€” now in 4-bit.

BF16 FP8 GPTQ

β˜€οΈ The pitch

This is the W4A16 GPTQ-quantized version of tacodevs/Behemoth-T1-123B β€” a 123B Mistral Large roleplay model that thinks like a literary author before it writes like a storyteller.

GPTQ is for single-GPU users. The full model fits on a single 80 GB or 96 GB GPU. Quality is ~95-97% of the BF16 reference β€” virtually indistinguishable from full precision in normal use.

For the full pitch, training details, and the philosophy behind T1, see the BF16 model card.

⚑ This variant

Value
Base tacodevs/Behemoth-T1-123B (BF16)
Quantization GPTQ W4A16 (4-bit weights, 16-bit activations)
Group size 128
Calibration 256 in-distribution samples from tacodevs/rp-opus-4.6-x1000
Quantizer llm-compressor GPTQModifier
Size on disk ~62 GB (4Γ— smaller than BF16)
VRAM (8k ctx) ~62 GB β†’ fits on 1Γ— 80 GB or 1Γ— 96 GB GPU
Quality vs BF16 ~95-97% (literary thinking pattern preserved)

🎀 How to use

T1 expects a prefilled <think> block to enter literary thinking mode. Use the same 7 prefill phrases as the BF16 model:

import openai

client = openai.OpenAI(base_url="http://localhost:8000/v1", api_key="-")

PREFILLS = {
    "analytical": "Ok i need to think about how to respond β€” what does the character feel right now, what from their experience is relevant, what do they value, and what are they trying to achieve, so",
    "creative":   "Ok i need to think as a creative writer β€” what twist would surprise here? Let me find an engaging new direction nobody saw coming, so",
    "unhinged":   "Ok i need to think as an unhinged author β€” raw, explicit, intense, fully in character with no holding back, so",
}

response = client.chat.completions.create(
    model="tacodevs/Behemoth-T1-123B-GPTQ",
    messages=[
        {"role": "system", "content": CHARACTER_CARD},
        *conversation_history,
        {"role": "user", "content": user_message},
        {"role": "assistant", "content": f"<think>\n{PREFILLS['creative']}\n"},
    ],
    extra_body={
        "continue_final_message": True,
        "add_generation_prompt": False,
    },
    temperature=0.6,
    max_tokens=2048,
    stop=["[INST]", "</s>"],
)

πŸš€ Serving with vLLM

vllm serve tacodevs/Behemoth-T1-123B-GPTQ \
    --tokenizer-mode auto \
    --max-model-len 8192 \
    --gpu-memory-utilization 0.9

Important: use --tokenizer-mode auto, not mistral β€” mistral_common mode silently mis-templates merged-LoRA checkpoints.

Single 80 GB H100, single 80 GB A100, or single 96 GB H100 NVL all fit comfortably.

🟑 Quality notes

W4A16 quantization has measurable but small impact on the literary thinking pattern T1 was trained for:

  • βœ… Stream-of-consciousness thinking shape β€” preserved (encoded across many attention layers)
  • βœ… Detail surfacing from character cards β€” preserved
  • βœ… Beats base R1 in side-by-side β€” preserved (the gap is huge)
  • 🟑 Specific word choices β€” may differ token-by-token from BF16
  • 🟑 The "cleverest" inventive details β€” sometimes replaced with equivalent-quality alternatives

For most users, GPTQ T1 is indistinguishable from BF16 T1 in normal use. Only A/B testing with the same seed would expose the differences.

If you want maximum quality and have the VRAM, use the BF16 reference or FP8 W8A8 (~99% of BF16, fits on 2Γ—80 GB).

πŸ› οΈ Training details (from base T1)

T1 is a LoRA distillation of Claude Opus 4.5 literary thinking onto tacodevs/Behemoth-X-R1-123B (itself an SCE merge of Behemoth-X creative writing + Behemoth-R1 reasoning).

LoRA rank 32 (alpha 64, dropout 0.05, all 7 projection modules)
Trainable params 559M / 123B (0.45%)
Dataset 1000 Claude Opus 4.5 thinking traces on real RP conversations
Loss masking Think-only (only the post-prefill thinking continuation gets loss)
Sequence length 4096
Epochs 2
Final eval loss 0.9898

The LoRA only learns the shape of literary thinking. The base model's RP prose engine receives zero gradient updates β€” the underlying creative writing voice is structurally preserved.

πŸ“œ Citation

@misc{behemoth-t1-2026,
  title  = {Behemoth-T1-123B: Literary Thinking Distillation for RP},
  author = {tacodevs},
  year   = {2026},
  url    = {https://huggingface.co/tacodevs/Behemoth-T1-123B},
}

The party doesn't end. We just go to bed.

Downloads last month
4
Safetensors
Model size
17B params
Tensor type
I64
Β·
I32
Β·
BF16
Β·
Inference Providers NEW
This model isn't deployed by any Inference Provider. πŸ™‹ Ask for provider support

Model tree for tacodevs/Behemoth-T1-123B-GPTQ

Quantized
(14)
this model