Hacker News new | ask | show | jobs
by kristjansson 708 days ago
FlashAttention's algorithmic improvements is mostly just splitting/combining the softmax part of attention, and is itself not totally novel. The overwhelming contribution is implementing that, and all its fiddly pieces, efficiently on Nvidia hardware.
2 comments

To clarify further, flash attention is explicitly targeting a compute engine with separate MMA and "scalar" vector execution units that allow post-processing the MMA outputs without involving memory bandwidth (though arithmetic intensity, especially relative between the MMA and the "scalar" instructions, is of concern), with a substantial amount of manually-managed L1D$ to use as sub-matrix accumulator, and a linear-in-context-length amount of "VRAM" that requires sensible arithmetic intensity to avoid being a bandwidth bottleneck (iirc in the hundreds when counting the scalar multiplies hiding in the MMA instructions).

This v3 with async might for once be so tied to Hopper that it's not trivially portable to another platform that has the mentioned hardware blocks (AFAIK every AMD GCN card that can do compute shaders would qualify, though they do lack a specialized MMA unit).

Clarifying:

Given the question: "How much is the flash attention algorithm tied to the hardware?"

The answer is 0.

ex. you can find generic flash attention recently added in llama.cpp and ONNX (MS needed it for Phi-3, needed for Recall).

On the side, novelty, I have no direct knowledge on, IMHO, asking that question would devolve the way novelty arguments do in any field: there's always someone else who can claim they did 80% of $X via $X-1, therefore, $X is by and large not novel. Ad infinitum.

I think the right analogy for FA is high-quality cache-aware BLAS kernel implementations. The algorithm(s) is (are) clever and (as you note) completely independent of hardware. However, a hardware-naive implementation is approximately worthless. Most of the value of MKL, or Accelerate, or FA is in the careful matching of the parameters and implementation of the algorithm to the capabilities of hardware it's going run on.

I definitely don't mean to take away from Tri/FA by mentioning novelty - I'm just repeating from paper, which refers back to algebraic aggregates[0] in its discussion of their tiled softmax.

[0]: https://web.stanford.edu/class/cs345d-01/rl/olap.pdf

> However, a hardware-naive implementation is approximately worthless.

This isn’t true when there is one vendor that’s 90% of the market and 2 maybe 3 generations of hardware to consider. Support A100, H100 and you are supporting most of the current market.

Supporting A100 and H100 is the opposite of being hardware naive, though.