|
|
|
|
|
by chisleu
287 days ago
|
|
I've been using GLM 4.5 and GLM 4.5 Air for a while now. The Air model is light enough to run on a macbook pro and is useful for Cline. I can run the full GLM model on my Mac Studio, but the TPS is so slow that it's only useful for chatting. So I hooked up with openrouter to try but didn't have the same success. Any of the open weight models I try with open router give sub standard results. I get better results from Qwen 3 coder 30b a3b locally than I get from Qwen 3 Coder 480b through open router. I'm really concerned that some of the providers are using quantized versions of the models so they can run more models per card and larger batches of inference. |
|
This doesn't match my experience precisely, but I've definitely had cases where some of the providers had consistently worse output for the same model than others, the solution there was to figure out which ones those are and to denylist them in the UI.
As for quantized versions, you can check it for each model and provider, for example: https://openrouter.ai/qwen/qwen3-coder/providers
You can see that these providers run FP4 versions:
And these providers run FP8 versions: I will say that it's not all bad and my experience with FP8 output has been pretty decent, especially when I need something done quickly and choose to use Cerebras - provided their service isn't overloaded, their TPS is really, really good.You can also request specific precision on a per request basis: https://openrouter.ai/docs/features/provider-routing#quantiz... (or just make a custom preset)