|
|
|
|
|
by bick_nyers
382 days ago
|
|
There's still a lot of opportunity for software optimizations here. Trouble is that really only two classes of systems get optimizations for Deepseek, namely 1 small GPU + a lot of RAM (ktransformers) and the system that has all the VRAM in the world. A system with say 192GB VRAM and rest standard memory (DGX station, 2xRTX Pro 6000, 4xB60 Dual, etc.) could still in theory run Deepseek @4bit quite quickly because of the power law type usage of the experts. If you aren't prompting Deepseek in Chinese, a lot of the experts don't activate. This would be an easier job for pruning, but still I think enthusiast systems are going to trend in a way the next couple years that makes these types of software optimizations useful on a much larger scale. There's a user on Reddit with a 16x 3090 system (PCIE 3.0 x4 interconnect which doesn't seem to be using full bandwidth during tensor parallelism) that gets 7 token/s in llama.cpp. A single 3090 has enough VRAM bandwidth to scan over its 24GB of memory 39 times per second, so there's something else going on limiting performance. |
|
That's about 5KW of power
> that gets 7 token/s in llama.cpp
Just looking at electricity bill it's cheaper to use API of any major providers.
> If you aren't prompting Deepseek in Chinese, a lot of the experts don't activate.
That's interesting, it means the model can be cut and those token routed to another closest expert, just in case they happened.