Hacker News new | ask | show | jobs
by namibj 1926 days ago
For the interested, LLVM-MCA says this

    Iterations:        10000
    Instructions:      100000
    Total Cycles:      25011
    Total uOps:        100000

    Dispatch Width:    4
    uOps Per Cycle:    4.00
    IPC:               4.00
    Block RThroughput: 2.5

    No resource or data dependency bottlenecks discovered.
, which to me seems like 2.5 cycles per iteration (on Zen3). Tigerlake is a bit worse, at about 3 cycles per iteration, due to running more uOPs per iteration, by the looks of it.

For the following loop core (extracted from `clang -O3 -march=znver3`, using trunk (5a8d5a2859d9bb056083b343588a2d87622e76a2)):

    .LBB5_2:                                # =>This Inner Loop Header: Depth=1
    mov     rdx, r11
    add     r11, r8
    mulx    rdx, rax, r9
    xor     rdx, rax
    mulx    rdx, rax, r10
    xor     rdx, rax
    mov     qword ptr [rdi + 8*rcx], rdx
    add     rcx, 2
    cmp     rcx, rsi
    jb      .LBB5_2