|
|
|
|
|
by avidphantasm
35 days ago
|
|
It's actually a bit faster than that now it seems, about 112 tok/sec. Configuration: Gemma 4 31B Instruct Q6K
Context size 40960
LM Studio 0.4.13+1
Metal llama.cpp v2.14.0
LM Studio MLX (Apple M5) v1.6.0 Here are my results: prompt eval time = 32545.36 ms / 5625 tokens ( 5.79 ms per token, 172.84 tokens per second)
eval time = 20227.99 ms / 310 tokens ( 65.25 ms per token, 15.33 tokens per second)
total time = 52773.35 ms / 5935 tokens This was for interacting with a local MCP service, running a tool that returns a ~20KB text file to the agent to add to the chat context. I'm seeing about the same number of tokens/second on an M2 Ultra that I have access to (also with 128GB of memory). This is surely apples-to-oranges to the OP results (and I don't spend a great deal of time benchmarking these things, so my methodology might be lacking), but it's interesting seeing okay performance for a top open model. For most use, however, I find Gemma 4 26B A4B (Q6K) to be good enough (esp. for MCP calling) and much much faster (~1,200 tokens/second). |
|