| HN Mirror

Y	Hacker News new \| ask \| show \| jobs


	by aurareturn 469 days ago
	CPUs do not have enough compute typically. You'll be compute bottlenecked before bandwidth if the model is large enough. Time to first token, context length, and tokens/s are significantly inferior on CPUs when dealing with larger models even if the bandwidth is the same.

1 comments

adrian_b 469 days ago

One big server CPUs can have a computational capability similar to a mid-range desktop NVIDIA GPU.

When used for ML/AI applications, a consumer GPU has much better performance per dollar.

Nevertheless, when it is desired to use much more memory than in a desktop GPU, a dual-socket server can have higher memory bandwidth than most desktop GPUs, i.e. more than an RTX 4090, and a computational capability that for FP32 could exceed an RTX 4080, but it would be slower for low-precision data where the NVIDIA tensor cores can be used.

link

kiratp 469 days ago

Nobody is using FP32 for AI.

INT8, INT4, FP8 and soon FP4

link

adrian_b 468 days ago

True, but I have compared the FP32 used in graphics computations because for that the throughput information is easily available.

Both CPUs (with the BF16 instructions and with the VNNI instructions for INT8 inference) and the GPUs have a higher throughput for lower precision data types than for FP32, but the exact acceleration factors are hard to find.

The Intel server CPUs have the advantage vs. AMD that they also have the AMX matrix instructions, which are intended to compete for inference applications with the NVIDIA tensor cores, but the Intel CPUs are much more expensive for a number of cores big enough to be competitive with GPUs.

link