Over the last few days people have asked me if they think NVIDIA is fkd.. It still takes two H100s to run inference on the DS v3 671b @ <200 tokens per second.
There are different versions of the model as well as using it with different levels of quantization.
Some variants of DeepSeek-R1 can be run on 2x H100 GPUs, and some people managed to get still quite decent results with a even stronger distilled mode running it on consumer hardware.
For DeepSeek-V3 even with 4bit quantization you need more like 16x H100.