| HN Mirror

Y	Hacker News new \| ask \| show \| jobs

by Const-me 26 days ago

> Most of those FLOPS are constrained by memory bandwidth

I believe inference with large enough batch size is almost always compute bound, simply due to algorithmic complexity.

Each step of tiled matric multiplication with square tiles of size N^2 takes O(N^2) memory loads and O(N^3) compute operations. With N = 32 or 64, you will likely saturate compute even on iGPUs with DDR4 or DDR5 memory pretending to be VRAM.

1 comments

zzzoom 26 days ago

Prefill (GEMM) is compute bound, decode (GEMV) is memory bound.

link

Const-me 26 days ago

> decode (GEMV) is memory bound

Decode with batch size 1 is GEMV. Batching makes the decode GEMM too.

link