| You can see the problem here: First it cals CFAbsoluteTimeGetCurrent and saves the result. 0x100272dab <+10443>: callq 0x1002b7b38 ; CFAbsoluteTimeGetCurrent
0x100272db0 <+10448>: movapd %xmm0, -0xa0(%rbp)
Here is the call to flatDecodeDirect. I guess RDI is the input. That's usual for the x64 ABI. 0x100272db8 <+10456>: movq -0x100(%rbp), %rdi
0x100272dbf <+10463>: callq 0x10026fb10 ; flatDecodeDirect
I don't know what this next bit is for. 0x100272dc4 <+10468>: testq %rax, %rax
0x100272dc7 <+10471>: js 0x10027444c ; <+16236> [inlined] generic specialization <FlatBuffersPerformanceTestDesktop.FlatBufferReader> of Swift._ContiguousArrayBuffer._checkValidSubscript (Swift.Int) -> ()
0x3e8=1000. The loop counter is in ECX. 0x100272dcd <+10477>: movl $0x3e8, %ecx
I can't figure out what's at these two addresses; lldb didn't seem to accept any reasonable syntax. lldb is terrible. But I'll bet that RBX is holding the value of `total'. I don't know what r14 is, and it doesn't seem to matter since nothing here uses it. 0x100272dd2 <+10482>: movq -0xd8(%rbp), %rbx
0x100272dd9 <+10489>: movq -0x198(%rbp), %r14
Here's the loop. The loop is unrolled 5 times. total+=result. 0x100274454 produces some kind of exception on integer overflow. 0x100272de0 <+10496>: addq %rax, %rbx
0x100272de3 <+10499>: jb 0x100274454 ; at flatbench.swift:284
0x100272de9 <+10505>: addq %rax, %rbx
0x100272dec <+10508>: jb 0x100274454
0x100272df2 <+10514>: addq %rax, %rbx
0x100272df5 <+10517>: jb 0x100274454
0x100272dfb <+10523>: addq %rax, %rbx
0x100272dfe <+10526>: jb 0x100274454
0x100272e04 <+10532>: addq %rax, %rbx
0x100272e07 <+10535>: jb 0x100274454
The loop was unrolled 5 times, so drop 5 from the loop counter and repeat. 0x100272e0d <+10541>: addq $-0x5, %rcx
0x100272e11 <+10545>: jne 0x100272de0 ; at flatbench.swift:276
Get current time. 0x100272e13 <+10547>: callq 0x1002b7b38 ; CFAbsoluteTimeGetCurrent
So this code actually times one call to flatDecodeDirect, then 200 iterations of an unrolled do-nothing loop. The compiler has figured out somehow that flatDecodeDirect is going to do exactly the same thing each time, and taken advantage of that by calling it only once. I'm guessing this means that flatDecodeDirect is only called 1,000 times in total.As a sanity check for this kind of thing - try making a little loop that just increments an integer the appropriate number of times, and see how long that takes. (Check the assembly language output to ensure the generated code is doing what you think - it should be a 2-instruction loop.) On my laptop that takes 1.8ms. This isn't the absolute limit of how long it takes to do 1,000,000 of anything, but it'll do as a rough estimate. So you should be suspicious if a program suggests it's taking much less time than that to do 1,000,000 of something that's a lot more complicated, as the test did. (It reported 1,000,000 iterations in 0.53ms on my PC.) (Of course, as with any rough estimate, this only gives you a suspicion, and isn't proof without further investigation.) |
Thanks for your help.