Instructions to use zai-org/GLM-5.1 with libraries, inference providers, notebooks, and local apps. Follow these links to get started.

Libraries

How to use zai-org/GLM-5.1 with Transformers:

# Use a pipeline as a high-level helper
from transformers import pipeline

pipe = pipeline("text-generation", model="zai-org/GLM-5.1")
messages = [
    {"role": "user", "content": "Who are you?"},
]
pipe(messages)

# Load model directly
from transformers import AutoTokenizer, AutoModelForCausalLM

tokenizer = AutoTokenizer.from_pretrained("zai-org/GLM-5.1")
model = AutoModelForCausalLM.from_pretrained("zai-org/GLM-5.1")
messages = [
    {"role": "user", "content": "Who are you?"},
]
inputs = tokenizer.apply_chat_template(
	messages,
	add_generation_prompt=True,
	tokenize=True,
	return_dict=True,
	return_tensors="pt",
).to(model.device)

outputs = model.generate(**inputs, max_new_tokens=40)
print(tokenizer.decode(outputs[0][inputs["input_ids"].shape[-1]:]))

Inference
HuggingChat
Notebooks
Google Colab
Kaggle
Local Apps

vLLM

How to use zai-org/GLM-5.1 with vLLM:

Install from pip and serve model

# Install vLLM from pip:
pip install vllm
# Start the vLLM server:
vllm serve "zai-org/GLM-5.1"
# Call the server using curl (OpenAI-compatible API):
curl -X POST "http://localhost:8000/v1/chat/completions" \
	-H "Content-Type: application/json" \
	--data '{
		"model": "zai-org/GLM-5.1",
		"messages": [
			{
				"role": "user",
				"content": "What is the capital of France?"
			}
		]
	}'

Use Docker

docker model run hf.co/zai-org/GLM-5.1

SGLang

How to use zai-org/GLM-5.1 with SGLang:

Install from pip and serve model

# Install SGLang from pip:
pip install sglang
# Start the SGLang server:
python3 -m sglang.launch_server \
    --model-path "zai-org/GLM-5.1" \
    --host 0.0.0.0 \
    --port 30000
# Call the server using curl (OpenAI-compatible API):
curl -X POST "http://localhost:30000/v1/chat/completions" \
	-H "Content-Type: application/json" \
	--data '{
		"model": "zai-org/GLM-5.1",
		"messages": [
			{
				"role": "user",
				"content": "What is the capital of France?"
			}
		]
	}'

Use Docker images

docker run --gpus all \
    --shm-size 32g \
    -p 30000:30000 \
    -v ~/.cache/huggingface:/root/.cache/huggingface \
    --env "HF_TOKEN=<secret>" \
    --ipc=host \
    lmsysorg/sglang:latest \
    python3 -m sglang.launch_server \
        --model-path "zai-org/GLM-5.1" \
        --host 0.0.0.0 \
        --port 30000
# Call the server using curl (OpenAI-compatible API):
curl -X POST "http://localhost:30000/v1/chat/completions" \
	-H "Content-Type: application/json" \
	--data '{
		"model": "zai-org/GLM-5.1",
		"messages": [
			{
				"role": "user",
				"content": "What is the capital of France?"
			}
		]
	}'

Docker Model Runner
How to use zai-org/GLM-5.1 with Docker Model Runner:
```
docker model run hf.co/zai-org/GLM-5.1
```

Why is the API for GLM-5.1 more expensive than GLM-5 when the model size is the same?

by fcMpKYz6Avp5QK - opened about 1 month ago

Discussion

fcMpKYz6Avp5QK

about 1 month ago

Hi team and community,
I noticed that the API pricing for GLM-5.1 is higher than GLM-5 on the Z.ai platform:
GLM-5.1: Input $1.4 / Output $4.4
GLM-5: Input $1 / Output $3.2
As far as I know, both models share the same architecture and parameter size (744B total, 40B active MoE).
So my question is: Why the price increase?
Is the inference efficiency worse due to defaults like Thinking Mode or agentic optimizations? Or is it purely a business decision (value-based pricing) because GLM-5.1 is highly optimized and much smarter via post-training?

What puzzles me most is that since GLM-5 and GLM-5.1 share the same architecture and parameter size, the inference cost (hardware requirement) should be identical. In an open-source ecosystem, anyone hosting the model would simply replace 5 with 5.1 at zero additional operational cost.
Therefore, choosing 5 over 5.1 just because it's 'cheaper' seems fundamentally irrational from a purely technical standpoint. Is this API pricing strictly a business strategy (value-based pricing to recover R&D costs), or is there an invisible technical overhead in 5.1 that I'm missing?"

I'd love to hear the technical or strategic reasons behind this. Thanks!

DUOWEN

about 1 month ago

•

edited about 1 month ago

Yes, I also wonder why GLM-5 shares the core technology DSA with and having a comparable size with DeepSeek-V3.2 (744B-A40B vs 671B-A37B) but is several times the price of the latter, it might be purely commercial considerations. (as you can notice that almost all providers on OpenRouter match their price to the official's)

Saptarshie

about 1 month ago

I suspect there might ( not sure) be 2 reason for this :
1)Chinese computation is much cheaper (due to abundance of energy and subsidy)....even though amarican chips are better .... , So American servers (like openrouter) easily gets undercut in front of Chinese computation ...
2) Data War: using point-(1) as leverage .....Chinese Companies are aggressively selling their own API/Openclaw services (even at a loss) [...that's one of the reasons that some Chinese models are getting proprietary(like glm-turbo series)].......so if you don't want to pay premium ...grab their CODING-plan🤓.

brandr0id

6 days ago

What puzzles me most is that since GLM-5 and GLM-5.1 share the same architecture and parameter size, the inference cost (hardware requirement) should be identical.

This assumption is not correct which might be where the confusion comes in. Here is a short explanation from ChatGPT:

Since GLM-5 and GLM-5.1 appear to be very similar MoE models with roughly the same active parameter count, their baseline per-token compute and minimum weight-memory requirements should be broadly similar. But their real inference cost is not guaranteed to be identical, because serving cost also depends on exact parameter count, routing behavior, attention implementation, context/output lengths, quantization, batching, cache behavior, inference framework, and any “thinking”/agentic usage patterns.

rzgar

2 days ago

GLM 5.1 is straight-up a disaster. I genuinely believe they intentionally made it dumber with the new licensing plans. it chokes on basic, mediocre everyday tasks just to enforce the usage limits ,
not trying to sound like a whiny bi*, but stuff that took it over an hour, grok finished in one go. You’ll probably think I’m exaggerating… until you try it yourself.

Upload images, audio, and videos by dragging in the text input, pasting, or clicking here.

Tap or paste here to upload images

· Sign up or log in to comment