| HN Mirror

Deepseek bypasses CUDA and has a few other optimisation that neither llama.cpp or vLLM support.

Furthermore, V4 pro was designed to run on 4 Huawei Ascend GPUs which are much cheaper than the nvidia setup others use, and deepseek probably also got some free hardware for their collab.

Hence it is entirely possible their inference costs are significantly lower than other providers.