Fix GLM 4.7 Lite MoE gating func

by BVEsun - opened Jan 21

Jan 21

Sorry I'm not familiar with llama.cpp.

Link: https://github.com/ggml-org/llama.cpp/pull/18980

It seems the problem is caused by llama.cpp's incorrect implementation of GLM 4.7 Flash. A fix should be coming soon, but will changing the gating function require regenerating all the GGUF files?

danielhanchen

Unsloth AI org Jan 21

We're updating now, theyre still uploading but seems to perform much better now!

danielhanchen

Unsloth AI org Jan 21

@BVEsun all updated now!

BVEsun

Jan 21

Thank you very much.

This discussion is now complete, so I will close it.

BVEsun changed discussion status to closed Jan 21

danielhanchen

Unsloth AI org Jan 21

Thank you very much.

This discussion is now complete, so I will close it.

Can you let us know if you see any improvements? thak you :)

BVEsun

Jan 22

Can you let us know if you see any improvements? thak you :)

Sorry for the delay. This is a big improvement over the last one, thanks! I'm still questioning the Glm4MoeLiteForCausalLM implementation, though. My tests show performance similar to Qwen3-30B-A3B-Thinking-2507. It might need more optimization, especially since Flash Attention isn't working yet.

BVEsun

28 days ago

Just a quick update:
With llama.cpp b7832, the update for GLM 4.7 Flash is a complete game-changer.
kv-cache: Now supports V-less cache (#19067), which cuts the VRAM usage for the KV cache in half.

danielhanchen

Unsloth AI org 28 days ago

Oh nice!

Upload images, audio, and videos by dragging in the text input, pasting, or clicking here.

Tap or paste here to upload images

· Sign up or log in to comment