Hacker News new | ask | show | jobs
by GreenGames 62 days ago
This reads like you didn’t read the post.

z-lab runs BF16 on B200 (54+ GB). There is no z-lab path that fits on a 24 GB 3090. That is literally the entire point of our work, and it is stated in the second paragraph. If you had checked the HF model card you linked before posting, you would see the same thing. Before this repo, there was no path to run this... SGLang's GGUF path for this model is broken. llama.cpp doesn't have DFlash speculative decoding at all. If you wanted to run this hybrid model fast on a 24 GB consumer card, there was nothing...

That took weeks of real engineering.

Calling that "vibecoded" because we used a bit of AI in the README is clean is the laziest possible critique. An LLM reading the DFlash paper does not catch verify_logits_buf being sized vocabq_len when DDTree reads vocab(budget+1). That is hours of debugging with nvidia-smi and memory sanitizers, not prompting.

The 207 and 129.5 numbers are both in the second sentence of the post and again in the TL;DR. 207.6 is peak tok/s in the linked demo video, 129.5 is the HumanEval 10-prompt mean at DDTree budget=22. We specify both just behind the title.

On the Q4 KV cache: the tradeoff is disclosed with actual numbers. AL 8.56 -> 8.33 at short context (3% drop), dramatically better at long context. It’s the only way 128K allocates on 24 GB. The binary is env-selectable, you can run BF16 KV if you don’t need 128K. Both are benchmarked.

3 comments

> This reads like you didn’t read the post.

I was discussing details I read in your repo. How did you conclude that I didn't read the post? I'm skeptical a human is writing these comments because everything you're posting reads like LLM output

> On the Q4 KV cache: the tradeoff is disclosed with actual numbers. AL 8.56 -> 8.33 at short context (3% drop), dramatically better at long context.

I'm sorry, but you're not the first (or LLM) to think of using Q4 KV cache to fit more context in VRAM.

The degradation is far more than 3% on real evals. Q8 only recently became usable on Qwen3.5 in llama.cpp with the context rotation changes. Before that bf16 was necessary to get decent performance in real tasks.

Q4 is a non-starter for real work. The fact that you're still trying to defend it tells me you haven't used this for anything other than token/sec racing.

This is an embarrassing reply. Unfortunately you’ve hit the hour mark so you cannot delete it. :(
You wrote this reply with Claude, and it's lying about it only being README.md. OP, and I, know this because you and Claude documented it.*

I use the same tools, I'm not mad at you for using it. It's just, idk man, you want to use it tactically in ways that are a net benefit to you. Not in ways that embarrass you or lie.

* https://github.com/Luce-Org/lucebox-hub/commit/cfc38f67275ee...

* * Here's Claude's version of this very post if you want to see an example of Claude voice vs. original and how to spot it: https://gist.githubusercontent.com/jpohhhh/a42060f0f34339c4b...