Hacker News new | ask | show | jobs
by brucethemoose2 846 days ago
On 33B/34B models I get 35 tokens/sec, way faster than I can read streaming in. At huge contexts (like 30K-74K), prompt processing takes forever and token generation is slower, but its still faster than I can read.

Miqu 70B is slow (less than 10 tok/sec, I think) because I have to split it with llama.cpp. I only use it for short context questions where I need a bit more intelligence.

And for reference, this is a SFF desktop! It's no Macbook, but still small enough (10L and flat) for me to fly with in carry on.