Technically DS is an independent sibling of Citroën within Stellantis, a sprawling car conglomerate that owns a dog’s dinner of car brands in Europe and USA.
If we want to get really technical, “DS4” is a model from Citroën and they later spun out the DS lineup into its separate brand, with the “Citroën DS4” becoming “DS 4”, “DS” being the make and “4” being the model.
Not really. I'm building now another fast C compiler with DeepSeek 4 Flash, and rarely have to step outside to use Pro or Sonnet, gpt or kimi-2.6. Flash is very capable of almost everything.
That's not a harness. That's an agent cli. A harness is something completely different. Wish people could use proper terminology.
A test harness is a collection of software and test data configured to test a program unit by running it under varying conditions and monitoring its behavior and outputs. It automates the execution of test suites, providing the necessary stubs, drivers, and runtime environments so developers can isolate and verify specific code components.
I use opencode (lockedcode is still vaporware), claude, kimi and codex.
And most models. Just no Google models so far, I don't trust them.
No, an agent cli is no harness. You have to provide a harness for an agent by yourself, otherwise it will run free. Which is called vibe coding. Free as you wish, without any harness.
May I ask about your trust issue regarding Google models?
Is it about quality issues (lack of guardrails, agent runs dangerous commands)? I have seen first-hand Gemini-cli going out of the project directory and using my home directory as a work area.
You're free to fight the terminology if you want (I did at first too), but the zeitgeist has chosen a meaning that disagrees with you, so people will see you as being deliberately obtuse and unpleasant when you fight back.
Learning when to let go is an incredibly important skill that I have learned way too late in life.
FYI, llama.cpp (which antirez/ds4 is inspired by) supports system ram. E.g. [1] is a good guide for running a similar-sized model with 128gb ram and a 3090-sized GPU.
Have you had it getting stuck in endless loops maybe ~10-20% of the invocations? Seems it happens for both the responses and chatcompletion APIs, and no matter what inference parameters I try it happens at least for 1/10 of the requests, I've tried every compatible vLLM version + currently using it from git (#main) yet the issue persists.
Seems to happen with various quantizations too, even the NVFP4 versions and any others, so seems like a deeper issue to me, or hardware incompatible perhaps.
That sounds memory bandwidth limited. Does the total t/s decode throughput improve by running multiple sessions in parallel?
(Note, that's total not per-session. Tok/s figures per session will initially tank since you're using the same total mem bandwidth to load incrementally more active params.)
> The blog post implies that it currently requires 96GB of VRAM.
Has anyone tested what happens if you try and run this on lower-RAM Macs? It might work and just be a bit slower as it falls back on fetching model layers from storage.
True (with 64GB RAM it'd have to fetch 20% of its active experts from disk already, about 650MB/tok at 2-bit quant - and that percentage rises quickly as you lower RAM further); my question is just a more practical one about whether it runs at all, how bad the slowdown is, and to what extent you might be able to get some of that decode throughput back by running multiple (slower) agent sessions in parallel under a single Dwarf Star 4 server.
Active params for this model is 13B which takes about 6.5GB at full native quantization, or perhaps 3.25GB at the 2bit quant that's being provided here, that should take significantly less than 10s to fetch on Mac storage, especially given that some fraction of the model weights would be cached in RAM. Sounds like something worth testing out if it can be made to work out of the box with DS4.
llama.cpp is general purpose in the sense that it supports many different model architectures. ds4 is laser focused on deepseek v4 flash, thus having a leaner codebase