| HN Mirror

Y	Hacker News new \| ask \| show \| jobs


	by gcr 35 days ago
	DwarfStar4 is a small LLM inference runtime that can run DeepSeek 4. The blog post implies that it currently requires 96GB of VRAM. For others who are lacking context :-)

6 comments

foresto 35 days ago

Thanks. Outside of LLM circles, DS4 is usually a video game controller.

link

artyom 35 days ago

Well, I was sitting here expecting the Redis creator have an opinion on still-unannounced Dark Souls 4.

link

low_tech_love 34 days ago

Haha the same here!!

link

oezi 35 days ago

Or a car from Citroen

link

pavlov 34 days ago

Technically DS is an independent sibling of Citroën within Stellantis, a sprawling car conglomerate that owns a dog’s dinner of car brands in Europe and USA.

link

Hamuko 34 days ago

If we want to get really technical, “DS4” is a model from Citroën and they later spun out the DS lineup into its separate brand, with the “Citroën DS4” becoming “DS 4”, “DS” being the make and “4” being the model.

link

pavlov 34 days ago

And even more pedantically, DS has recently adopted a new naming scheme where the former DS 4 is now written as DS N°4, pronounced "number 4"...

Their stated inspiration for this SEO bomb is Chanel perfumes.

link

orthoxerox 34 days ago

It's still the Lexus to Citroen's Toyota.

link

drcongo 34 days ago

Pavlov's dog's dinner?

link

insensible 35 days ago

Trekkies are experiencing a major regression from Deep Space Nine.

link

kjs3 34 days ago

There were prototypes. The Cardassians never get it right the first (eight) times.

link

burnte 34 days ago

Deep Space 4 vanished and was never seen again.

link

RALaBarge 34 days ago

They never should have trusted Qwark

link

jofzar 35 days ago

I am actually kind of disappointed it wasn't a deep dive on the dual shock 4

link

smcleod 34 days ago

That's the flash version not the full model and only at Q2-3~ so while impressive it's still quite different from the full model.

link

rurban 34 days ago

Not really. I'm building now another fast C compiler with DeepSeek 4 Flash, and rarely have to step outside to use Pro or Sonnet, gpt or kimi-2.6. Flash is very capable of almost everything.

link

gekoxyz 34 days ago

which harness are you using? pi? opencode?

link

rurban 34 days ago

That's not a harness. That's an agent cli. A harness is something completely different. Wish people could use proper terminology.

A test harness is a collection of software and test data configured to test a program unit by running it under varying conditions and monitoring its behavior and outputs. It automates the execution of test suites, providing the necessary stubs, drivers, and runtime environments so developers can isolate and verify specific code components.

I use opencode (lockedcode is still vaporware), claude, kimi and codex.

And most models. Just no Google models so far, I don't trust them.

link

modmans2nd 20 days ago

It is really telling when people say that. It’s clear they think the job of harness development is done by selecting the agent environment.

link

computably 34 days ago

Akshually, they said "harness," and not "test harness."

There's no particular reason "agent harness" can't have practically the same definition, substituting test-specific concepts for agent-specific ones.

link

Sinidir 34 days ago

Harness: a piece of equipment with straps and belts, used to control or hold in place a person, animal, or object.

So yes the generel meaning applies to test setup and running and also to the agent cli which is the harness for the model.

link

rurban 34 days ago

No, an agent cli is no harness. You have to provide a harness for an agent by yourself, otherwise it will run free. Which is called vibe coding. Free as you wish, without any harness.

link

dolmen 34 days ago

May I ask about your trust issue regarding Google models?

Is it about quality issues (lack of guardrails, agent runs dangerous commands)? I have seen first-hand Gemini-cli going out of the project directory and using my home directory as a work area.

Or is it about terms of service?

Or other concerns?

link

rurban 33 days ago

Quality. They are too dumb.

And the lack of ease of use.

link

tredre3 34 days ago

You're free to fight the terminology if you want (I did at first too), but the zeitgeist has chosen a meaning that disagrees with you, so people will see you as being deliberately obtuse and unpleasant when you fight back.

Learning when to let go is an incredibly important skill that I have learned way too late in life.

link

DeathArrow 35 days ago

>The blog post implies that it currently requires 96GB of VRAM.

From the Github page it seems it only supports Apple and DGX Spark. I have 128 GB of RAM and a 3090 but it probably won't work.

link

thomasm6m6 34 days ago

FYI, llama.cpp (which antirez/ds4 is inspired by) supports system ram. E.g. [1] is a good guide for running a similar-sized model with 128gb ram and a 3090-sized GPU.

[1] https://unsloth.ai/docs/models/tutorials/minimax-m27

(Unsloth's deepseek-v4 support is still WIP)

link

DeathArrow 34 days ago

Thanks, I can run Qwen 3.6 27B with vllm, but I was curious about antirez tool.

link

embedding-shape 34 days ago

Have you had it getting stuck in endless loops maybe ~10-20% of the invocations? Seems it happens for both the responses and chatcompletion APIs, and no matter what inference parameters I try it happens at least for 1/10 of the requests, I've tried every compatible vLLM version + currently using it from git (#main) yet the issue persists.

Seems to happen with various quantizations too, even the NVFP4 versions and any others, so seems like a deeper issue to me, or hardware incompatible perhaps.

link

modmans2nd 20 days ago

There’s a fixed version out there with corrected templates.

link

manmal 34 days ago

It wouldn’t be useful with your setup, probably 3-4 token per second.

link

DeathArrow 34 days ago

Yep, maybe I can open a feature request if it makes sense technically.

link

zozbot234 34 days ago

Arguably it makes more sense technically to get the model support into llama.cpp, which provides many options for GPU+CPU split inference already.

link

hellifino 33 days ago

I have an AMD 3995wx and 128GB DDR4 3200 I can load the Q2 and using -t 64 can get around 4 t/s out of the box. Havent tried any other configs yet.

I do not think it can use multi-gpu or gpu/cpu offloading at this time.

link

zozbot234 33 days ago

That sounds memory bandwidth limited. Does the total t/s decode throughput improve by running multiple sessions in parallel?

(Note, that's total not per-session. Tok/s figures per session will initially tank since you're using the same total mem bandwidth to load incrementally more active params.)

link

zozbot234 34 days ago

> The blog post implies that it currently requires 96GB of VRAM.

Has anyone tested what happens if you try and run this on lower-RAM Macs? It might work and just be a bit slower as it falls back on fetching model layers from storage.

link

conradkay 34 days ago

It'd be way slower since you'd be doing that work every token

link

zozbot234 34 days ago

True (with 64GB RAM it'd have to fetch 20% of its active experts from disk already, about 650MB/tok at 2-bit quant - and that percentage rises quickly as you lower RAM further); my question is just a more practical one about whether it runs at all, how bad the slowdown is, and to what extent you might be able to get some of that decode throughput back by running multiple (slower) agent sessions in parallel under a single Dwarf Star 4 server.

link

computably 34 days ago

Storage is multiple orders of magnitude slower than RAM. Pretty sure it'd be more like 10s/tok than anything reasonable.

link

zozbot234 34 days ago

Active params for this model is 13B which takes about 6.5GB at full native quantization, or perhaps 3.25GB at the 2bit quant that's being provided here, that should take significantly less than 10s to fetch on Mac storage, especially given that some fraction of the model weights would be cached in RAM. Sounds like something worth testing out if it can be made to work out of the box with DS4.

link

Wowfunhappy 34 days ago

Thanks. How is DwarfStar4 different from llama.cpp?

link

covoeus 34 days ago

llama.cpp is general purpose in the sense that it supports many different model architectures. ds4 is laser focused on deepseek v4 flash, thus having a leaner codebase

link

rpigab 34 days ago

I knew Death Stranding 3 wasn't out yet!

link