|
|
|
|
|
by mickeyp
16 days ago
|
|
Impressive work. But the problem is not the 30 tok/s which is fine for agentic coding and chat. It's prefill; slow prefill kills agentic workloads dead. If you have 100,000 tokens at ~150tok/s per the OP, you're looking at: You have: 100000 / (150/s)
You want: hms
11 min + 6.6666667 sec
Which is quite a wait indeed. |
|
This is also a problem for all of the Mac local LLMs. Macs are a great way to get a lot of high bandwidth memory, but their compute is very far behind current gen dedicated GPUs. Some of the expensive Mac Studio setups allow you to run very large models with usable tokens/s, but you can be waiting a long time for it to get to the point of generating those tokens.