> Most of those FLOPS are constrained by memory bandwidth
I believe inference with large enough batch size is almost always compute bound, simply due to algorithmic complexity.
Each step of tiled matric multiplication with square tiles of size N^2 takes O(N^2) memory loads and O(N^3) compute operations. With N = 32 or 64, you will likely saturate compute even on iGPUs with DDR4 or DDR5 memory pretending to be VRAM.
That's also a CPU that came out four years later than the A100. The contemporaneous B200 is not optimized for FP32 and does 74.45 TFLOP/s. For FP16 it's at ~2 PFLOP/s.
but it is very impressive how far modern CPUs get as well (also in smart phones!)