| HN Mirror

Y	Hacker News new \| ask \| show \| jobs

by rfoo 483 days ago

That's correct, because FA can't turn inference time from memory-access bound into compute-bound. But your claim on that decoding is compute-bound is plainly wrong.

FA, compared to naive implementation, made training / prefill (i.e. when you can have multiple tokens in the same sequence visible) compute-bound instead of memory-access bound.

So, currently, on MHA/GQA, with Flash Attention, training/prefill is compute-bound, whereas decoding is memory-access-bound.

Before FA, both prefill / decode are bound by memory-access. FA solved the problem of training/prefill. But because kvcache is large, decoding is inherently bound by memory-access.

Our goal is always to make everything compute-bound.

2 comments

menaerus 483 days ago

> But your claim on that decoding is compute-bound is plainly wrong.

I did not say anything like that? What I said is that FlashAttention and arguably MLA will not make any significant gains in the inference time. And this is true.

Also, FWIW there are certainly model shapes that are compute-bound in the decode phase so saying that decoding is universally inherently bound by memory access is what is plain wrong, if I were to use your dictionary.

link

rfoo 483 days ago

Apologize if I got it wrong, but:

> MLA, FlashAttention and similar optimizations will provide the benefits only when memory access time dominates

> Those would be [...] not the decode phase

This does sound like you are saying that memory access time does NOT dominate during the decode phase. But it does.

Reading your quotes, it looks like maybe you are talking about GPU utilization issues? (i.e. not launching enough threads). Due to the parallelization strategy of the original FA it indeed does not even keep the GPU busy if q*bs is too small. But this is not an inherent limitation of FA-style kernels and can be solved and people did solve it. Or you simply batch more. Now you can keep the GPUs busy at 100% waiting for memory access, but memory access time still dominates, hence "memory-access-bound". And here comes MLA.

> FWIW there are certainly model shapes that are compute-bound in the decode phase

Yeah. But so far all I read don't really work ("work" means being at least just slightly worse than alternatives) under same wall-clock time compute budget. Do you have any pointer to a working example, even on smaller 3B-ish models?

link

menaerus 483 days ago

> This does sound like you are saying that memory access time does NOT dominate during the decode phase. But it does.

Let's take llama3-8B for an example. GFLOPS needed for self-attention per-layer per-token is roughly 0.15 GFLOPS. For simplicity reasons let's assume that we store all our weights in FP8 precision, then our load memory-bandwidth required for the same is 0.05 GB. Store memory-bandwidth is negligible. If we expand this further to a 1k tokens context, this becomes ~180 GFLOPS and ~0.35 GB per-layer per-1k-ctx.

Assuming that our HW is H100, is this compute-bound or memory-bound?

link

rfoo 483 days ago

You need to load cached k/v tensor, in addition to weights. It's going to take me some minutes to find out what's wrong in this napkin math. Will edit or reply this comment later.

link

menaerus 483 days ago

Re-computing everything every time is the worst-case scenario and which is why I included it in the example (1k tokens). In that case, KV-cache is obviously set to 0 but it is also obvious that it is a much worse alternative than using the KV-cache. Which is pretty much the reason why we have the KV-cache. Therefore the argument about loading the cached tensors doesn't make a difference at all.

> It's going to take me some minutes to find out what's wrong in this napkin math.

I am sure you will. Please don't be so entitled.

link

rfoo 483 days ago

> Therefore the argument about loading the cached tensors doesn't make a difference at all.

Sorry, what? Who the fuck in this world runs decode without k/v cache??! If you run without k/v cache you are basically doing prefill for every token you generate and that's not what we called "decode". That's what we called "prefill".

k/v cache, while named "cache", is a lot more important than what people would perceive as a "cache". It's the essential part of the algorithm. If you lose your k/v cache you must run prefill again. If you run prefill for every token you generate it's not O(n^2), it's going to be O(n^3).

And yeah, you can run prefill 1000 times to generate a 1000 tokens output. Or you can run prefill once and with the persisted k/v cache run decode 1000 times. Tradeoff has to be made here but it simply makes no sense to drop a k/v cache in the middle of generating a response, as your number shows, recomputing is guaranteed to be slower than loading k/v cache.

> Please don't be so entitled.

When someone came up with a wrong number, I try to be nice and run the numbers myself and figure out why someone would end up with such a number and point out the specific mistake, instead of dumping a page of my own calculation. It's usually just a missing factor somewhere. Guess I shouldn't be so nice to retards who keep insisting that you can be fine without k/v cache during decoding. Also in this case I admit I failed to have a theory on why your number is so off because giving out prefill numbers and claiming it's decode isn't in my book.

Yeah, I know this sounds extremely mean, feel free to downvote, but I hope readers can feel my frustration now.

link

rfoo 483 days ago

... and batching does not help, you batch more requests and get more kvcache to load, still memory-access bound.

MLA made it possible to cache a smaller form of k/v, mitigating (but not completely solve, on shorter context & smaller batches it's still memory-access bound) the problem.

link