Hacker News new | ask | show | jobs
by jheriko 3513 days ago
i'm curious where this information comes from. a lot of it jars heavily against personal experience and measurements i can do for myself.

the idea that floating point division is that much more expensive than multiplication for instance... the only difference afaik is latency, not timing.

the idea that an indirect call and a virtual function call are so close as well... when it is a read followed by an indirect call- whilst giving timings for some of the reads are considerably greater than either is an utter nonsense on inspection.

take with a great pinch of salt and remember the one correct way to judge timings is to measure them in context instead of guessing based on information that could well be wrong.

imo this kind of article is harmful and misses the more important lesson to learn: measure, measure, measure.

2 comments

There is a "references" section at the bottom:

[Agner4] Agner Fog, “Instruction tables. Lists of instruction latencies, throughputs and micro-operation breakdowns for Intel, AMD and VIA CPUs”

[Agner3] Agner Fog, “The microarchitecture of Intel, AMD and VIA CPUs. An optimization guide for assembly programmers and compiler makers”

[Intel.Skylake] “Intel® 64 and IA-32 Architectures Optimization Reference Manual”, 2-6, Intel

[Levinthal] David Levinthal, “Performance Analysis Guide for Intel® CoreTM i7 Processor and Intel® XeonTM 5500 processors”, 22

[NoBugs] 'No Bugs' Hare, “C++ for Games: Performance. Allocations and Data Locality”

[AlBahra] Samy Al Bahra, “Nonblocking Algorithms and Scalable Multicore Programming”

[eruskin] http://assemblyrequired.crashworks.org/how-slow-are-virtual-...

[Agner1] Agner Fog, “Optimizing software in C++. An optimization guide for Windows, Linux and Mac platforms”

[Efficient C++] Dov Bulka, David Mayhew, “Efficient C++: Performance Programming Techniques”Amazon, p. 115

[Drepper] Ulrich Drepper, “Memory part 5: What programmers can do”, section 6.2.2

[TCMalloc] Sanjay Ghemawat, Paul Menage, “TCMalloc : Thread-Caching Malloc”

[Wikipedia.ProtectionRing] “Protection Ring”, Wikipedia

[Ongaro] Diego Ongaro, “The Cost of Exceptions of C++”

[LiEtAl] Chuanpeng Li, Chen Ding, Kai Shen, “Quantifying The Cost of Context Switch”

Multiplies are significantly cheaper than divisions in most recent processors.

First of all latency is the most important parameter: after memory bandwith, the latency of the longest dependency chain is tipically the bottleneck, especially for floating point code.

For example on Skylake float muls have 4 cycle latency (same as adds and MADs) vs over a minimum of 14 cycles for divisions.

But even when only cosidering thoughput, Skylake has two fully pipelined MAD units and can start 2 multiplies every clock cycle, while its single division unit is only partially pipelined and can start a new div only every fourth clock cycle (it is also, IIRC, only 128 bit wide so 256 bits vector divs are more expensive still).

Avoiding divs (and mods) is something that it is still worth optimising for.