| HN Mirror

Y	Hacker News new \| ask \| show \| jobs

by furyofantares 234 days ago

> Training Performance is Real (When It Works)

It looks like it worked? Why's it say this?

> Verdict: Inference speed scales proportionally with model size.

Author only tried one model size and it's faster than NVIDIA's reported speed at a larger model. Not really a "Verdict".

> Verdict: 4-bit quantization is production-viable.

That's not really something you can conclude from messing around with it and saying you like the outputs.

> GPU Inference is Fundamentally Broken

Probably not? It probably just doesn't work in llama.cpp right now? Takes a while reading this to work out they tried ollama and then later llama.cpp, which I'd guess is basically testing llama.cpp twice. Actually I don't even believe that, I'm sure author ran into errors that might be a pain to figure out, but there's no evidence it's worse than that.

But then it says this is the "root cause":

    ARM64 + Blackwell + CUDA 13.0 = Bleeding Edge
    ↓
    Limited production testing
    ↓
    Edge cases in numerical precision (inference)
    ↓
    Memory management issues (training)

Am I to believe GPU inference is really fundamentally broken? I'm not seeing the case made here, just claims. At this point the LLM seems to have gotten confused about whether it's talking about the memory fragmentation issue or the GPU inference issue. But it's hard to believe anything from this point on in the post.