|
|
|
|
|
by brucethemoose2
846 days ago
|
|
On 33B/34B models I get 35 tokens/sec, way faster than I can read streaming in. At huge contexts (like 30K-74K), prompt processing takes forever and token generation is slower, but its still faster than I can read. Miqu 70B is slow (less than 10 tok/sec, I think) because I have to split it with llama.cpp. I only use it for short context questions where I need a bit more intelligence. And for reference, this is a SFF desktop! It's no Macbook, but still small enough (10L and flat) for me to fly with in carry on. |
|