Instructions to use tacodevs/Behemoth-T1-123B-GPTQ with libraries, inference providers, notebooks, and local apps. Follow these links to get started.

Libraries

How to use tacodevs/Behemoth-T1-123B-GPTQ with Transformers:

# Use a pipeline as a high-level helper
from transformers import pipeline

pipe = pipeline("text-generation", model="tacodevs/Behemoth-T1-123B-GPTQ")
messages = [
    {"role": "user", "content": "Who are you?"},
]
pipe(messages)

# Load model directly
from transformers import AutoTokenizer, AutoModelForCausalLM

tokenizer = AutoTokenizer.from_pretrained("tacodevs/Behemoth-T1-123B-GPTQ")
model = AutoModelForCausalLM.from_pretrained("tacodevs/Behemoth-T1-123B-GPTQ")
messages = [
    {"role": "user", "content": "Who are you?"},
]
inputs = tokenizer.apply_chat_template(
	messages,
	add_generation_prompt=True,
	tokenize=True,
	return_dict=True,
	return_tensors="pt",
).to(model.device)

outputs = model.generate(**inputs, max_new_tokens=40)
print(tokenizer.decode(outputs[0][inputs["input_ids"].shape[-1]:]))

Notebooks
Google Colab
Kaggle
Local Apps

vLLM

How to use tacodevs/Behemoth-T1-123B-GPTQ with vLLM:

Install from pip and serve model

# Install vLLM from pip:
pip install vllm
# Start the vLLM server:
vllm serve "tacodevs/Behemoth-T1-123B-GPTQ"
# Call the server using curl (OpenAI-compatible API):
curl -X POST "http://localhost:8000/v1/chat/completions" \
	-H "Content-Type: application/json" \
	--data '{
		"model": "tacodevs/Behemoth-T1-123B-GPTQ",
		"messages": [
			{
				"role": "user",
				"content": "What is the capital of France?"
			}
		]
	}'

Use Docker

docker model run hf.co/tacodevs/Behemoth-T1-123B-GPTQ

SGLang

How to use tacodevs/Behemoth-T1-123B-GPTQ with SGLang:

Install from pip and serve model

# Install SGLang from pip:
pip install sglang
# Start the SGLang server:
python3 -m sglang.launch_server \
    --model-path "tacodevs/Behemoth-T1-123B-GPTQ" \
    --host 0.0.0.0 \
    --port 30000
# Call the server using curl (OpenAI-compatible API):
curl -X POST "http://localhost:30000/v1/chat/completions" \
	-H "Content-Type: application/json" \
	--data '{
		"model": "tacodevs/Behemoth-T1-123B-GPTQ",
		"messages": [
			{
				"role": "user",
				"content": "What is the capital of France?"
			}
		]
	}'

Use Docker images

docker run --gpus all \
    --shm-size 32g \
    -p 30000:30000 \
    -v ~/.cache/huggingface:/root/.cache/huggingface \
    --env "HF_TOKEN=<secret>" \
    --ipc=host \
    lmsysorg/sglang:latest \
    python3 -m sglang.launch_server \
        --model-path "tacodevs/Behemoth-T1-123B-GPTQ" \
        --host 0.0.0.0 \
        --port 30000
# Call the server using curl (OpenAI-compatible API):
curl -X POST "http://localhost:30000/v1/chat/completions" \
	-H "Content-Type: application/json" \
	--data '{
		"model": "tacodevs/Behemoth-T1-123B-GPTQ",
		"messages": [
			{
				"role": "user",
				"content": "What is the capital of France?"
			}
		]
	}'

Docker Model Runner
How to use tacodevs/Behemoth-T1-123B-GPTQ with Docker Model Runner:
```
docker model run hf.co/tacodevs/Behemoth-T1-123B-GPTQ
```

🌴 Behemoth-T1-123B-GPTQ 🌴

The party where literary craft meets unhinged creative writing — now in 4-bit.

☀️ The pitch

This is the W4A16 GPTQ-quantized version of tacodevs/Behemoth-T1-123B — a 123B Mistral Large roleplay model that thinks like a literary author before it writes like a storyteller.

GPTQ is for single-GPU users. The full model fits on a single 80 GB or 96 GB GPU. Quality is ~95-97% of the BF16 reference — virtually indistinguishable from full precision in normal use.

For the full pitch, training details, and the philosophy behind T1, see the BF16 model card.

⚡ This variant

	Value
Base	tacodevs/Behemoth-T1-123B (BF16)
Quantization	GPTQ W4A16 (4-bit weights, 16-bit activations)
Group size	128
Calibration	256 in-distribution samples from `tacodevs/rp-opus-4.6-x1000`
Quantizer	llm-compressor GPTQModifier
Size on disk	~62 GB (4× smaller than BF16)
VRAM (8k ctx)	~62 GB → fits on 1× 80 GB or 1× 96 GB GPU
Quality vs BF16	~95-97% (literary thinking pattern preserved)

🎤 How to use

T1 expects a prefilled <think> block to enter literary thinking mode. Use the same 7 prefill phrases as the BF16 model:

import openai

client = openai.OpenAI(base_url="http://localhost:8000/v1", api_key="-")

PREFILLS = {
    "analytical": "Ok i need to think about how to respond — what does the character feel right now, what from their experience is relevant, what do they value, and what are they trying to achieve, so",
    "creative":   "Ok i need to think as a creative writer — what twist would surprise here? Let me find an engaging new direction nobody saw coming, so",
    "unhinged":   "Ok i need to think as an unhinged author — raw, explicit, intense, fully in character with no holding back, so",
}

response = client.chat.completions.create(
    model="tacodevs/Behemoth-T1-123B-GPTQ",
    messages=[
        {"role": "system", "content": CHARACTER_CARD},
        *conversation_history,
        {"role": "user", "content": user_message},
        {"role": "assistant", "content": f"<think>\n{PREFILLS['creative']}\n"},
    ],
    extra_body={
        "continue_final_message": True,
        "add_generation_prompt": False,
    },
    temperature=0.6,
    max_tokens=2048,
    stop=["[INST]", "</s>"],
)

🚀 Serving with vLLM

vllm serve tacodevs/Behemoth-T1-123B-GPTQ \
    --tokenizer-mode auto \
    --max-model-len 8192 \
    --gpu-memory-utilization 0.9

Important: use --tokenizer-mode auto, not mistral — mistral_common mode silently mis-templates merged-LoRA checkpoints.

Single 80 GB H100, single 80 GB A100, or single 96 GB H100 NVL all fit comfortably.

🟡 Quality notes

W4A16 quantization has measurable but small impact on the literary thinking pattern T1 was trained for:

✅ Stream-of-consciousness thinking shape — preserved (encoded across many attention layers)
✅ Detail surfacing from character cards — preserved
✅ Beats base R1 in side-by-side — preserved (the gap is huge)
🟡 Specific word choices — may differ token-by-token from BF16
🟡 The "cleverest" inventive details — sometimes replaced with equivalent-quality alternatives

For most users, GPTQ T1 is indistinguishable from BF16 T1 in normal use. Only A/B testing with the same seed would expose the differences.

If you want maximum quality and have the VRAM, use the BF16 reference or FP8 W8A8 (~99% of BF16, fits on 2×80 GB).

🛠️ Training details (from base T1)

T1 is a LoRA distillation of Claude Opus 4.5 literary thinking onto tacodevs/Behemoth-X-R1-123B (itself an SCE merge of Behemoth-X creative writing + Behemoth-R1 reasoning).


LoRA rank	32 (alpha 64, dropout 0.05, all 7 projection modules)
Trainable params	559M / 123B (0.45%)
Dataset	1000 Claude Opus 4.5 thinking traces on real RP conversations
Loss masking	Think-only (only the post-prefill thinking continuation gets loss)
Sequence length	4096
Epochs	2
Final eval loss	0.9898

The LoRA only learns the shape of literary thinking. The base model's RP prose engine receives zero gradient updates — the underlying creative writing voice is structurally preserved.

📜 Citation

@misc{behemoth-t1-2026,
  title  = {Behemoth-T1-123B: Literary Thinking Distillation for RP},
  author = {tacodevs},
  year   = {2026},
  url    = {https://huggingface.co/tacodevs/Behemoth-T1-123B},
}

The party doesn't end. We just go to bed.

Downloads last month: 4

Safetensors

Model size

17B params

Tensor type

I64

I32

BF16

Model tree for tacodevs/Behemoth-T1-123B-GPTQ

Base model

mistralai/Mistral-Large-Instruct-2411

Finetuned

TheDrummer/Behemoth-R1-123B-v2

Quantized

(14)

this model