Fix GLM 4.7 Lite MoE gating func

#9
by BVEsun - opened

Sorry I'm not familiar with llama.cpp.

Link: https://github.com/ggml-org/llama.cpp/pull/18980

It seems the problem is caused by llama.cpp's incorrect implementation of GLM 4.7 Flash. A fix should be coming soon, but will changing the gating function require regenerating all the GGUF files?

Unsloth AI org

We're updating now, theyre still uploading but seems to perform much better now!

Unsloth AI org

@BVEsun all updated now!

Thank you very much.

This discussion is now complete, so I will close it.

BVEsun changed discussion status to closed
Unsloth AI org

Thank you very much.

This discussion is now complete, so I will close it.

Can you let us know if you see any improvements? thak you :)

Can you let us know if you see any improvements? thak you :)

Sorry for the delay. This is a big improvement over the last one, thanks! I'm still questioning the Glm4MoeLiteForCausalLM implementation, though. My tests show performance similar to Qwen3-30B-A3B-Thinking-2507. It might need more optimization, especially since Flash Attention isn't working yet.

Just a quick update:
With llama.cpp b7832, the update for GLM 4.7 Flash is a complete game-changer.
kv-cache: Now supports V-less cache (#19067), which cuts the VRAM usage for the KV cache in half.

Unsloth AI org

Oh nice!

Sign up or log in to comment