| HN Mirror

Y	Hacker News new \| ask \| show \| jobs


	by DeathArrow 79 days ago
	>The blog post implies that it currently requires 96GB of VRAM. From the Github page it seems it only supports Apple and DGX Spark. I have 128 GB of RAM and a 3090 but it probably won't work.

3 comments

thomasm6m6 79 days ago

FYI, llama.cpp (which antirez/ds4 is inspired by) supports system ram. E.g. [1] is a good guide for running a similar-sized model with 128gb ram and a 3090-sized GPU.

[1] https://unsloth.ai/docs/models/tutorials/minimax-m27

(Unsloth's deepseek-v4 support is still WIP)

link

DeathArrow 78 days ago

Thanks, I can run Qwen 3.6 27B with vllm, but I was curious about antirez tool.

link

embedding-shape 78 days ago

Have you had it getting stuck in endless loops maybe ~10-20% of the invocations? Seems it happens for both the responses and chatcompletion APIs, and no matter what inference parameters I try it happens at least for 1/10 of the requests, I've tried every compatible vLLM version + currently using it from git (#main) yet the issue persists.

Seems to happen with various quantizations too, even the NVFP4 versions and any others, so seems like a deeper issue to me, or hardware incompatible perhaps.

link

modmans2nd 64 days ago

There’s a fixed version out there with corrected templates.

link

manmal 78 days ago

It wouldn’t be useful with your setup, probably 3-4 token per second.

link

DeathArrow 78 days ago

Yep, maybe I can open a feature request if it makes sense technically.

link

zozbot234 78 days ago

Arguably it makes more sense technically to get the model support into llama.cpp, which provides many options for GPU+CPU split inference already.

link

hellifino 77 days ago

I have an AMD 3995wx and 128GB DDR4 3200 I can load the Q2 and using -t 64 can get around 4 t/s out of the box. Havent tried any other configs yet.

I do not think it can use multi-gpu or gpu/cpu offloading at this time.

link

zozbot234 77 days ago

That sounds memory bandwidth limited. Does the total t/s decode throughput improve by running multiple sessions in parallel?

(Note, that's total not per-session. Tok/s figures per session will initially tank since you're using the same total mem bandwidth to load incrementally more active params.)

link