|
|
|
|
|
by ay
341 days ago
|
|
You will need to first sit and ballpark, and then sit and benchmark, and discover your ballpark was probably wrong anyhow:-) Some (for me) useful pointers to that regard for both: 1. https://www.agner.org/optimize/instruction_tables.pdf - an extremely nice resource on micro architectural impacts of instructions 2. https://llvm.org/docs/CommandGuide/llvm-mca.html - tooling from Intel that allows to see some of these in real machine code 3. https://www.intel.com/content/www/us/en/developer/articles/t... - shows you whether the above is matching the reality (besides the CPU alone, more often than not your bottleneck is actually memory accesses; at least on the first access which wasn’t triggered by a hardware prefetcher or a hint to it. On Linux it would be staring at “perf top” results. So, the answer is as is very often - “it depends”. |
|
1 - https://www.uops.info/index.html similar content to Anger's tables
2 - https://reflexive.space/zen2-ibs/ how to capture per micro op data on AMD >= Zen 1 CPUs
I agree on "it depends". And usually not only on your actual code and data, but also how you arrange it over cache lines, what other code on the same core/complex/system is doing to your view of the cache and some other internal CPU features like prefetchers or branch predictors.