I just tried with llama.cpp RTX4090 (24GB) GGUF unsloth quant UD_Q4_K_XL You can...

kpw94 · 2026-04-02T20:17:27 1775161047

> I'll need to investigate further but it doesn't seem promising.

That's what I meant by "waiting a few days for updates" in my other comment. Qwen 3.5 release, I remember a lot of complaints about: "tool calling isn't working properly" etc.

That was fixed shortly after: there was some template parsing work in llama.cpp. and unsloth pulled out some models and brought back better one for improving something else I can't quite remember, better done Quantization or something...

coder543 pointed out the same is happening regarding tool calling with gemma4: https://news.ycombinator.com/item?id=47619261

GistNoesis · 2026-04-02T20:34:36 1775162076

The model does call tools successfully giving sensible parameters but it seems to not picking the right ones in the right order.

I'll try in a few days. It's great to be able to test it already a few hours after the release. It's the bleeding edge as I had to pull the last from main. And with all the supply chain issues happening everywhere, bleeding edge is always more risky from a security point of view.

There is always also the possibility to fine-tune the model later to make sure it can complete the custom task correctly. But the code for doing some Lora for gemma4 is probably not yet available. The 50% extra speed seems really tempting.

amarshall · 2026-04-02T20:25:39 1775161539

If you are running on 4090 and get 5 t/s, then you exceeded your VRAM and are offloading to the CPU (or there is some other serious perf. issue)

mrinterweb · 2026-04-03T00:28:46 1775176126

Thank you. I have the same card, and I noticed the same ~100 TPS when I ran Q3.5-35B-A3B. G4 26B A4B running at 150TPS is a 50% performance gain. That's pretty huge.