Hacker News new | ask | show | jobs
by anonym29 26 days ago
I know it's only tangentially relevant, but I've been baffled by the interest in DeepSeek V4 Flash. It's larger, less efficient, and in many cases, performs worse on both objective benchmarks and real world sniff test (admittedly, n=1) than Minimax M2.7. DS4F hallucinates at extraordinary rates while M2.7 does not. The 196k context length that M2.7 was natively trained up represents neither a hard technical ceiling (this is metadata that can easily adjusted), nor a meaningful degradation threshold - I've personally ran it up past 330k token context windows where it maintained full coherency, and still completed my one-shot agentic task to my satisfaction.
3 comments

M2.7 is no longer open source, it's been changed to a NC license. It's an OK model, but IME out of the big 5 chinese models (ds, glm, kimi, minimax and qwen), DS models have generally shown better generalisation and real-world usage than all the others, even if the benchmark scores were lower. Less benchmaxxxing, basically.

DS4 also has some neat new arch improvements, giving it a lot of context at lower VRAM usage. So it will be cheaper to serve, B for B than previous models.

M2.7 was never open source, only open weight, which fulfills a lot of the spirit of open source, but isn't really the same thing as a whole. The noncommercial license is basically impossible to enforce if you're self-hosting anyway, because it's essentially impossible to prove that any individual commit was made by Minimax M2.7 in an environment where multiple self-hosted models are being run side-by-side. Besides that, you're not obligated to abide by terms you never agreed to in the first place, and you don't need to agree to anyone's terms to download open weights from a peer or over a torrent. These weights amount to public information that freely exists and is shared in the commons; not a scarce, rivalrous good; not copyrighted works; not sensitive intellectual property.

The weights may nominally be legally copyrighted, but the rightsholder certainly doesn't seem to be making anything resembling a serious effort to actually assert or defend those rights; on the contrary, they are doing the exact opposite by maximizing the gratis distribution, including knowingly and willingly via third parties, with no copy protection whatsoever, and no reasonable expectation of non-distribution.

They are not behaving like an entity trying to protect valuable intellectual property, they are behaving like an entity trying to reap the reputational and network effect benefits of maximizing the free distribution of a public good.

Less memory usage by the KV cache doesn't mean cheaper to serve overall. Once you've acquired hardware (for which you need more to serve DS4L than Minimax M2.7, the former being ~54B total params larger model to begin with, and which KV cache memory efficiency does nothing to address), the capex cost is basically fixed and opex just comes down to power draw, which will be marginally higher per token with DS4L than with M2.7 owed to the slower speeds that result from 13B active params vs 10B active params on forward passes during TG.

KV cache size is the main constraint on batching (for any given ctx length), that's a huge deal for efficiency both locally and in the data center. DeepSeek V4's reduced KV requirement is a real game changer, it definitively unlocks batching requests together for local inference, not just at scale.
This may be relevant for parallelizable workloads. For reference on my perspective: I come at this as someone who is exclusively concerned with sequential, non-parallelizable, single-user, single-system workloads.
If you have multiple chats going at the same time in your LLM web interface, that's already a parallelizable workload wrt. batched inference. And this broadly describes the more sophisticated users of LLMs (who are using it for more than just casual chit-chat), especially wrt. the largest "pro" models. Parallelism is also quite applicable to agentic workloads.
As to the 2nd part of your message, it's really easy to verify yourself (on openrouter).

DSv4-flash is currently being served at 0.14/0.24 $/MTok by most of the providers (8 as of writing this) and even a bit cheaper by 2 providers.

Minimax2.7 is being served at 0.30/1.20 $/MTok by most providers (4 providers as of writing this) and double that price by 2 providers.

As for the first part of your message, this is actually a good illustration of the miss-understanding of licensing LLMs. There are open-source models out there (Apache 2.0 and MIT) and there are also source-available (i.e. open weights) in llamas, minimax2.7 and something in between with the latest kimi (MIT w/ attribution). Open source in the context of LLMs means that you get a license to run, inspect, modify and re-release a model. It was never about data or training. But that's a very common interpretation, that's wrong IMO. But I get that it's contested, so anyway. Sorry for the tangent.

Third party inference costs are a moot point for people running these models locally.

I am currently serving Minimax M2.7 to myself at ~$0.015/1M blended tokens worth of electricity on my own local hardware, where I get all of the confidentiality, integrity, and availability benefits that are lost when choosing to run open weight models on someone else's API.

Open source means that all of the information necessary to recreate the final product is public, which in the context of LLMs, would include all of the training material, and build instructions (scripts to do the training). Very few models actually achieve this - Nemotron family is the only one that comes top of mind. A license to run, inspect, modify, and re-release is a good improvement on open weight models, but does not alone amount to the model actually being open source.

You are welcome to an alternative understanding of the definition of open source - as you correctly note, it's a contested term - just know that your definition is not the more widely accepted one that people think of when they hear "open source".

Your version of the term is much more aligned with the OSI, which was a federation of anti-FLOSS industry bodies created with the intent to capture, redefine, and weaken the original spirit of the FLOSS movement, which predates the OSI by almost a decade - the GPL was first released in '89, compared to the OSI's formation in '98 by members of the $10B for-profit Netscape Corporation, who's flasgship product was originally proprietary and was only open sourced after commercial failure against proprietary competitors.

None of this should be construed as an implication that I'm anti-open-weight. As I mentioned earlier, I think open weight models fulfill a lot of the spirit of open source. While a world where truly open source models are the norm is obviously preferable to a world where only open weight models are the norm, a world where only open weight models are the norm is still vastly preferable to a world where proprietary models running on other people's hardware is the norm.

I just think that we should be careful to avoid watering down terminology in ways that serve proprietary commercial interests over the interests of the public and of users. Open-washing is real, and it harms the intersts of users.

> Open source in the context of LLMs means that you get a license to run, inspect, modify and re-release a model. It was never about data or training.

eeeh? what?

the whole reason "open-weights" phrase got coined was because corps started sharing weights, but no way to replicate the training that created it

it was viewed the same as sharing compiled binary, but no source code - against the whole point of open-source

> The 196k context length that M2.7 was natively trained up represents neither a hard technical ceiling (this is metadata that can easily adjusted), nor a meaningful degradation threshold

FWIW, I find that in OpenCode it starts becoming erratic after around 80k tokens (sometimes less).

May I ask you what did you used for the DS4F inference? It is a model with very low hallucination rate in my tests.
Per AA's Omniscience Index benchmark, the "non-hallucination rate" subcomponent (1 - hallucination rate) of 4% for DS4F vs 66% for M2.7.

https://artificialanalysis.ai/leaderboards/models?weights=op...

In the same page DS4F scores much better on Omniscent Accuracy. I would take those numbers with a bit of salt. For instance I ran different benchmarks against Qwen 3.6 27B and DS4F quantized at 2bit. DS4F hallucination rate is much lower. In general I find artificialanalysis benchmarks not very aligned with what I see in the field, but in this specific case I did many tests and it is even more so.
Btw, a few data points:

1. DS4F can run on a 128GB MacBook. M2.7 is larger (8 bit weights of routed experts). There is to see how it holds at 4 bits. At 2 bits it may not work well at all.

2. Just the KV cache of M2.7 would take ~50GB for 200k tokens AFAIK. It does not have the compressed KV cache that DS4F features.

3. The models are very similar in performances, despite all that. And DS4F is likely getting an update soon.

So it is basically a quasi-frontier model that can run on a 96/128GB MacBook at large context windows. That's non trivial. Likely a coding version could be released in the future.

>1. DS4F can run on a 128GB MacBook. M2.7 is larger (8 bit weights of routed experts). There is to see how it holds at 4 bits. At 2 bits it may not work well at all.

M2.7 is smaller than DS4, 230B total params vs 284B total params. At any given quantization level, M2.7 will require ~19% less memory for the weights than DS4F at the same quantization level. Both can be quantized to arbitrary precision levels. Larger models like these quantize much better at lower precision than smaller models do. There is still loss, but it's less catastrophic in terms of usability degradation than for say, 27B or 14B or 8B models. Again, n=1, but M2.7 holds up phenomenally well for me with unsloth's IQ2_XXS UD.

>2. Just the KV cache of M2.7 would take ~50GB for 200k tokens AFAIK. It does not have the compressed KV cache that DS4F features.

KV cache weights can also be quantized. At Q8_0, this is essentially lossless. I can fit a 400k context window with Q8_0 KV cache quantization along with unsloth's IQ2_XXS UD weight quantization (plus my running OS) on a machine with just 128 GB of unified memory. Strix Halo, not Apple Silicon. There are more exotic approaches to KV cache quantization with much higher efficiency, like TurboQuant, but this is besides the point.

>3. The models are very similar in performances, despite all that. And DS4F is likely getting an update soon.

Yes, though it's worth noting that DS4F does require about 20% more total memory for weights at any given quantization level (284B vs 230B), will need to shuffle about 30% more data through the pipeline on every forward pass (A13B vs A10B), has much higher hallucination rates per AA, and hasn't been fully post-trained. DS4 isn't a base model, it has been instruct trained, tool trained, etc, but there is a lot of capability that has been left on the table as of current checkpoints, which are what's actually available now.

>So it is basically a quasi-frontier model that can run on a 96/128GB MacBook at large context windows. That's non trivial. Likely a coding version could be released in the future.

MiniMax M2.7 fits into this same box - quasi-frontier model that can run on 96/128GB unified memory platforms with a large context window. You're right that it's non-trivial. My preference comes in part from the fact that M2.7 already is coding focused, and had been out for almost 2 months before DS4F showed up.

By the way, in spite of my preference for M2.7 over DS4F (and for Vulkan over ROCm on my hardware), I'm a big fan of your work on DarkStar 4. I admire what you've achieved with the project, how much work you've put into it, and your willingness to share that with the world, too. Thank you for your contributions to the open LLM ecosystem.

Didn't know M2.7 could also resist extreme quantizations, I had the feeling that being it shipped Q8 it was easily damaged in that way. Very interesting data point! And thank you for the nice words. Btw it really looks like ~250/300B parameters very sparse models are something for local inference.