Hacker News new | ask | show | jobs
by vardump 3979 days ago
As a software dev who has used and benefited from AVX2, I think the last data point in the chart is quite a bit off. Perhaps their benchmarking software didn't support AVX2 yet (Haswell/Broadwell) yet?

> Blue depicts parallel performance, while purely sequential performance is shown in orange.

Sequential trend seems to totally disregard up to doubled integer vector performance in AVX2, first introduced in Haswell. Original AVX didn't support 256-bit wide integer vectors. Also Haswell up to "doubled" [1] FLOP/cycle.

[1]: If your workload is FMA. That said, it is a pretty common FPU workload. See for example http://stackoverflow.com/questions/15655835/flops-per-cycle-... for reference.

2 comments

The amount of float point arithmetic done on the CPU in modern games is less than you'd think. For graphics you are looking at no more than 1000s of matmuls on the CPU, 10 000s at worst. Games that do their physics on the CPU still struggle to saturate it even with SSE. In games a good amount of CPU time is spent moving bytes around, which is precisely why we have data oriented programming.

The reason the CPU bottlenecked pre-DX12/pre-Vulkan games is because a hefty amount of it was used for the abstraction: likely little of which was FLOPs. Basically: you forgot to profile first ;). For a purely graphics workload the CPU will be doing relatively small amounts of work.

In addition I can say with fair certainty that a smaller portion of the gamer market upgrades their CPU on a regular basis (so-called enthusiasts). Taking myself as an example: I haven't upgraded my CPU since 2010 and I've only started feeling that pinch this year. That's a lot of CPUs without AVX2 support.

TLDR; AVX2 is currently in the "solutions looking for a problem" bucket.

The chart seems to indicate about 25 GFLOPS for sequential performance, while real value is up to 100 GFLOPS theoretical at 3.1 GHz on Haswell/Broadwell on a single core.

While realistic single core performance won't of course be approaching 100 GFLOPS, 25 is a pretty lowball value.

Is it possible that they're using "sequential" strictly, to mean that the arithmetic isn't vectorized? What's the scalar throughput like?
Scalar output would be way less than that number, 25 GFLOPS. At most 2x clock frequency. It's likely their benchmark just doesn't support AVX2 (and FMA [1]).

You get about 25 GFLOPS if you use SSE only.

[1]: https://en.wikipedia.org/wiki/FMA_instruction_set