| HN Mirror

Y	Hacker News new \| ask \| show \| jobs


	by anoncareer0212 481 days ago
	Small point of order: bit slower might not set expectations accurately. You noted in a previous post in the same thread[^1] that we'd expect about a 1 minute per 10K tokens(!) prompt processing time with the smaller model. I agree, and contribute to llama.cpp. If anything, that is quite generous. [^1] https://news.ycombinator.com/item?id=43595888

1 comments

terhechte 481 days ago

I don't think the time grows linearly. The more context the slower (at least in my experience because the system has to throttle). I just tried 2k tokens in the same model that I used for the 120k test some weeks ago and processing took 12 sec to first token (qwen 2.5 32b q8).

link

anoncareer0212 481 days ago

Hmmm, I might be rounding off wrong? Or reading it wrong?

IIUC the data we have:

2K tokens / 12 seconds = 166 tokens/s prefill

120K tokens / (10 minutes == 600 seconds) = 200 token/s prefill

link

kgwgk 481 days ago

> The more context the slower

It seems the other way around?

120k : 2k = 600s : 10s

link