Hacker News new | ask | show | jobs
by DeathArrow 33 days ago
>The blog post implies that it currently requires 96GB of VRAM.

From the Github page it seems it only supports Apple and DGX Spark. I have 128 GB of RAM and a 3090 but it probably won't work.

3 comments

FYI, llama.cpp (which antirez/ds4 is inspired by) supports system ram. E.g. [1] is a good guide for running a similar-sized model with 128gb ram and a 3090-sized GPU.

[1] https://unsloth.ai/docs/models/tutorials/minimax-m27

(Unsloth's deepseek-v4 support is still WIP)

Thanks, I can run Qwen 3.6 27B with vllm, but I was curious about antirez tool.
Have you had it getting stuck in endless loops maybe ~10-20% of the invocations? Seems it happens for both the responses and chatcompletion APIs, and no matter what inference parameters I try it happens at least for 1/10 of the requests, I've tried every compatible vLLM version + currently using it from git (#main) yet the issue persists.

Seems to happen with various quantizations too, even the NVFP4 versions and any others, so seems like a deeper issue to me, or hardware incompatible perhaps.

There’s a fixed version out there with corrected templates.
It wouldn’t be useful with your setup, probably 3-4 token per second.
Yep, maybe I can open a feature request if it makes sense technically.
Arguably it makes more sense technically to get the model support into llama.cpp, which provides many options for GPU+CPU split inference already.
I have an AMD 3995wx and 128GB DDR4 3200 I can load the Q2 and using -t 64 can get around 4 t/s out of the box. Havent tried any other configs yet.

I do not think it can use multi-gpu or gpu/cpu offloading at this time.

That sounds memory bandwidth limited. Does the total t/s decode throughput improve by running multiple sessions in parallel?

(Note, that's total not per-session. Tok/s figures per session will initially tank since you're using the same total mem bandwidth to load incrementally more active params.)