|
|
|
|
|
by tyfon
123 days ago
|
|
I didn't really understand the performance table until I saw the top ones were 8B models. But 5 seconds / token is quite slow yeah. I guess this is for low ram machines? I'm pretty sure my 5950x with 128 gb ram can run this faster on the CPU with some layers / prefill on the 3060 gpu I have. I also see that they claim the process is compute bound at 2 seconds/token, but that doesn't seem correct with a 3090? |
|
DDR4 tops out about 27Gbs
DDR5 can do around 40Gbs
So for 70B model at 8 bit quant, you will get around 0.3-0.5 tokens per second using RAM alone.