|
|
|
|
|
by petuman
310 days ago
|
|
> assuming I need 1k tokens/second throughput (on each, so 20 x 1k) 3.6B activated at Q8 x 1000 t/s = 3.6TB/s just for activated model weights (there's also context). So pretty much straight to B200 and alike. 1000 t/s per user/agent is way too fast, make it 300 t/s and you could get away with 5090/RTX PRO 6000. |
|