Hacker News new | ask | show | jobs
by BeeOnRope 2307 days ago
The thing about needing cpuid isnt true except perhaps on some older AMD hardware.

lfence works as a execution barrier and has an explicit cost of only a few cycles. You can accurately time a region with something like:

    lfence
    rdtsc
    lfence
    // timed region
    lfence
    rdtsc
This will give you accurate timing with some offset (i.e. even with an empty region you get a result on the order of 25-40 cycles), which you can mostly subtract out.

Carefully done you can get results down to a nanosecond or so.

rdtscp has few advantages over lfence + rdtsc, and arguably some disadvantages (you can control where the implied fence goes).

1 comments

Specifically, the Intel manual makes the following important points, one involving an `mfence;lfence` combo:

* If software requires RDTSC to be executed only after all previous instructions have executed and all previous loads are globally visible, it can execute LFENCE immediately before RDTSC.

* If software requires RDTSC to be executed only after all previous instructions have executed and all previous loads and stores are globally visible, it can execute the sequence MFENCE;LFENCE immediately before RDTSC.

* If software requires RDTSC to be executed prior to execution of any subsequent instruction (including any memory accesses), it can execute the sequence LFENCE immediately after RDTSC. This instruction was introduced by the Pentium processor.

rdtscp is usually a bit more disruptive, and cpuid is probably 100 or 1000 times more disruptive.