Fix GLM 4.7 Lite MoE gating func
Sorry I'm not familiar with llama.cpp.
Link: https://github.com/ggml-org/llama.cpp/pull/18980
It seems the problem is caused by llama.cpp's incorrect implementation of GLM 4.7 Flash. A fix should be coming soon, but will changing the gating function require regenerating all the GGUF files?
We're updating now, theyre still uploading but seems to perform much better now!
Thank you very much.
This discussion is now complete, so I will close it.
Thank you very much.
This discussion is now complete, so I will close it.
Can you let us know if you see any improvements? thak you :)
Can you let us know if you see any improvements? thak you :)
Sorry for the delay. This is a big improvement over the last one, thanks! I'm still questioning the Glm4MoeLiteForCausalLM implementation, though. My tests show performance similar to Qwen3-30B-A3B-Thinking-2507. It might need more optimization, especially since Flash Attention isn't working yet.
Just a quick update:
With llama.cpp b7832, the update for GLM 4.7 Flash is a complete game-changer.
kv-cache: Now supports V-less cache (#19067), which cuts the VRAM usage for the KV cache in half.
Oh nice!