|
|
|
|
|
by aurareturn
469 days ago
|
|
CPUs do not have enough compute typically. You'll be compute bottlenecked before bandwidth if the model is large enough. Time to first token, context length, and tokens/s are significantly inferior on CPUs when dealing with larger models even if the bandwidth is the same. |
|
When used for ML/AI applications, a consumer GPU has much better performance per dollar.
Nevertheless, when it is desired to use much more memory than in a desktop GPU, a dual-socket server can have higher memory bandwidth than most desktop GPUs, i.e. more than an RTX 4090, and a computational capability that for FP32 could exceed an RTX 4080, but it would be slower for low-precision data where the NVIDIA tensor cores can be used.