| HN Mirror

Y	Hacker News new \| ask \| show \| jobs


	by perfmode 38 days ago
	How’s the token throughput / response time?

1 comments

simonw 38 days ago

Healthy!

  prefill: 30.91 t/s, generation: 29.58 t/s

From https://gist.github.com/simonw/31127f9025845c4c9b10c3e0d8612...

link

antirez 37 days ago

Prefill is 400 t/s in that hardware. Just if the prompt is very short you can't see the real speed and it will default to single token context processing.

link

simonw 37 days ago

Hah, that's my fault for just using "Generate an SVG of a pelican riding a bicycle" as my test prompt!

link

xienze 38 days ago

I don't want to be a jerk but 31t/s prefill is basically unusable in an agentic situation. A mere 10k in context and you're sitting there for 5+ minutes before the first token is generated.

link

fgfarben 38 days ago

That prefill number isn't right. M4 Max hits 200-300: https://github.com/antirez/ds4/blob/main/speed-bench/m4_max_...

link

hadlock 38 days ago

M5 studio is gonna sell like hot cakes

link

throwdbaaway 38 days ago

Hah, that's because the prompt itself was only about 30 tokens. We need a much bigger prompt to properly test PP.

link

aiscoming 38 days ago

if it's just the coding agent system prompt and tools, you can cache that

link

xienze 38 days ago

Yeah the problem is that's just the start of the context. There's, you know, all the tool call results and file reads and stuff.

link

embedding-shape 38 days ago

Comparison with a RTX Pro 6000, with DeepSeek-V4-Flash-IQ2XXS-w2Q2K-AProjQ8-SExpQ8-OutQ8-chat-v2-imatrix.gguf:

prefill: 121.76 t/s, generation: 47.85 t/s

Main target seems to be Apple's Metal, so makes sense. Might be fun to see how fast one could make it go though :) The model seems really good too, even though it's in IQ2.

link

incidentist 37 days ago

Someone is working on a fork that is optimized for M5, might be worth a look: https://github.com/Swival/ds4-m5

link

rtpg 38 days ago

what are token speeds like for frontier models, if that gives a rough idea of how much "slower" slow is?

link