Hacker News new | ask | show | jobs
by gcr 35 days ago
DwarfStar4 is a small LLM inference runtime that can run DeepSeek 4. The blog post implies that it currently requires 96GB of VRAM.

For others who are lacking context :-)

6 comments

Thanks. Outside of LLM circles, DS4 is usually a video game controller.
Well, I was sitting here expecting the Redis creator have an opinion on still-unannounced Dark Souls 4.
Haha the same here!!
Or a car from Citroen
Technically DS is an independent sibling of Citroën within Stellantis, a sprawling car conglomerate that owns a dog’s dinner of car brands in Europe and USA.
If we want to get really technical, “DS4” is a model from Citroën and they later spun out the DS lineup into its separate brand, with the “Citroën DS4” becoming “DS 4”, “DS” being the make and “4” being the model.
And even more pedantically, DS has recently adopted a new naming scheme where the former DS 4 is now written as DS N°4, pronounced "number 4"...

Their stated inspiration for this SEO bomb is Chanel perfumes.

It's still the Lexus to Citroen's Toyota.
Pavlov's dog's dinner?
Trekkies are experiencing a major regression from Deep Space Nine.
There were prototypes. The Cardassians never get it right the first (eight) times.
Deep Space 4 vanished and was never seen again.
They never should have trusted Qwark
I am actually kind of disappointed it wasn't a deep dive on the dual shock 4
That's the flash version not the full model and only at Q2-3~ so while impressive it's still quite different from the full model.
Not really. I'm building now another fast C compiler with DeepSeek 4 Flash, and rarely have to step outside to use Pro or Sonnet, gpt or kimi-2.6. Flash is very capable of almost everything.
which harness are you using? pi? opencode?
That's not a harness. That's an agent cli. A harness is something completely different. Wish people could use proper terminology.

A test harness is a collection of software and test data configured to test a program unit by running it under varying conditions and monitoring its behavior and outputs. It automates the execution of test suites, providing the necessary stubs, drivers, and runtime environments so developers can isolate and verify specific code components.

I use opencode (lockedcode is still vaporware), claude, kimi and codex.

And most models. Just no Google models so far, I don't trust them.

It is really telling when people say that. It’s clear they think the job of harness development is done by selecting the agent environment.
Akshually, they said "harness," and not "test harness."

There's no particular reason "agent harness" can't have practically the same definition, substituting test-specific concepts for agent-specific ones.

Harness: a piece of equipment with straps and belts, used to control or hold in place a person, animal, or object.

So yes the generel meaning applies to test setup and running and also to the agent cli which is the harness for the model.

No, an agent cli is no harness. You have to provide a harness for an agent by yourself, otherwise it will run free. Which is called vibe coding. Free as you wish, without any harness.
May I ask about your trust issue regarding Google models?

Is it about quality issues (lack of guardrails, agent runs dangerous commands)? I have seen first-hand Gemini-cli going out of the project directory and using my home directory as a work area.

Or is it about terms of service?

Or other concerns?

Quality. They are too dumb.

And the lack of ease of use.

You're free to fight the terminology if you want (I did at first too), but the zeitgeist has chosen a meaning that disagrees with you, so people will see you as being deliberately obtuse and unpleasant when you fight back.

Learning when to let go is an incredibly important skill that I have learned way too late in life.

>The blog post implies that it currently requires 96GB of VRAM.

From the Github page it seems it only supports Apple and DGX Spark. I have 128 GB of RAM and a 3090 but it probably won't work.

FYI, llama.cpp (which antirez/ds4 is inspired by) supports system ram. E.g. [1] is a good guide for running a similar-sized model with 128gb ram and a 3090-sized GPU.

[1] https://unsloth.ai/docs/models/tutorials/minimax-m27

(Unsloth's deepseek-v4 support is still WIP)

Thanks, I can run Qwen 3.6 27B with vllm, but I was curious about antirez tool.
Have you had it getting stuck in endless loops maybe ~10-20% of the invocations? Seems it happens for both the responses and chatcompletion APIs, and no matter what inference parameters I try it happens at least for 1/10 of the requests, I've tried every compatible vLLM version + currently using it from git (#main) yet the issue persists.

Seems to happen with various quantizations too, even the NVFP4 versions and any others, so seems like a deeper issue to me, or hardware incompatible perhaps.

There’s a fixed version out there with corrected templates.
It wouldn’t be useful with your setup, probably 3-4 token per second.
Yep, maybe I can open a feature request if it makes sense technically.
Arguably it makes more sense technically to get the model support into llama.cpp, which provides many options for GPU+CPU split inference already.
I have an AMD 3995wx and 128GB DDR4 3200 I can load the Q2 and using -t 64 can get around 4 t/s out of the box. Havent tried any other configs yet.

I do not think it can use multi-gpu or gpu/cpu offloading at this time.

That sounds memory bandwidth limited. Does the total t/s decode throughput improve by running multiple sessions in parallel?

(Note, that's total not per-session. Tok/s figures per session will initially tank since you're using the same total mem bandwidth to load incrementally more active params.)

> The blog post implies that it currently requires 96GB of VRAM.

Has anyone tested what happens if you try and run this on lower-RAM Macs? It might work and just be a bit slower as it falls back on fetching model layers from storage.

It'd be way slower since you'd be doing that work every token
True (with 64GB RAM it'd have to fetch 20% of its active experts from disk already, about 650MB/tok at 2-bit quant - and that percentage rises quickly as you lower RAM further); my question is just a more practical one about whether it runs at all, how bad the slowdown is, and to what extent you might be able to get some of that decode throughput back by running multiple (slower) agent sessions in parallel under a single Dwarf Star 4 server.
Storage is multiple orders of magnitude slower than RAM. Pretty sure it'd be more like 10s/tok than anything reasonable.
Active params for this model is 13B which takes about 6.5GB at full native quantization, or perhaps 3.25GB at the 2bit quant that's being provided here, that should take significantly less than 10s to fetch on Mac storage, especially given that some fraction of the model weights would be cached in RAM. Sounds like something worth testing out if it can be made to work out of the box with DS4.
Thanks. How is DwarfStar4 different from llama.cpp?
llama.cpp is general purpose in the sense that it supports many different model architectures. ds4 is laser focused on deepseek v4 flash, thus having a leaner codebase
I knew Death Stranding 3 wasn't out yet!