|
|
|
|
|
by logicchains
1150 days ago
|
|
That's unfortunate. Running the 65B Q4 on an AMD Epyc with 32 1.5ghz cores and 256 GB of ram I get around 3 tokens/sec, which is useable if not ideal. I wonder if the difference is related to the RAM or the number of CPUs? |
|
The latest Genoa has 12 channel DDR5-4800 support (and boosted AVX-512) and I'd imagine should perform quite well, but if you primarily want to run inference on a quantized 65B model, I think you're best bang/buck (for local hardware) would be 2 x RTX 3090s (each of those has 24GB of GDDR6X w/ just shy of 1TB/s of memory bandwidth).