Hacker News new | ask | show | jobs
by mzaks 3679 days ago
I Updated the run bench function

https://gist.github.com/mzaks/e3a2dc7ccdfc2397bc26c55eb6dc8a...

the output is now:

  Eager run
  =================================
  1557 ms encode
  264 ms decode
  34 ms use
  206 ms dealloc
  504 ms decode+use+dealloc
  0,38 ms direct
  0,32 ms using struct
  =================================
  Total counter1 is 8644311667000000
  Total counter2 is 8644311667000000
  Total counter3 is 8644311667000000
  Encoded size is 315 bytes, should be 344 if not using unique strings
  =================================

As you can see all three counters are equal.
3 comments

The function call is not being optimized out, it's being hoisted outside the loop. I.e., it is as if the code was written as:

    let result = flatuseStruct(outputData)
    for _ in 0..<iterations {
        assert(result == 8644311667)
        total = total + UInt64(result)
    }
The counter will still be correct, but you are not measuring what you think you are measuring.
This makes sense!

Changed the iteration to:

   for i in 0..<iterations {
     let result = flatuseStruct(outputData, start:i)
     assert(result == 8644311666 + Int(i))
     total2 = total2 + UInt64(result)
   }
now result is around 43ms

Thanks for pointing it out. Have to check if there are so other things involved, but that might be it.

That's 0.35 nanoseconds to do something.

If it is a nop on a modern cpu it can run x4 nops at the same time. Either you aren't doing useful work or your measurements are wrong.

Unfortunately I can't view gists in work.

You can see the problem here:

First it cals CFAbsoluteTimeGetCurrent and saves the result.

    0x100272dab <+10443>: callq  0x1002b7b38 ; CFAbsoluteTimeGetCurrent
    0x100272db0 <+10448>: movapd %xmm0, -0xa0(%rbp)
Here is the call to flatDecodeDirect. I guess RDI is the input. That's usual for the x64 ABI.

    0x100272db8 <+10456>: movq   -0x100(%rbp), %rdi
    0x100272dbf <+10463>: callq  0x10026fb10 ; flatDecodeDirect
I don't know what this next bit is for.

    0x100272dc4 <+10468>: testq  %rax, %rax
    0x100272dc7 <+10471>: js     0x10027444c ; <+16236> [inlined] generic specialization <FlatBuffersPerformanceTestDesktop.FlatBufferReader> of Swift._ContiguousArrayBuffer._checkValidSubscript (Swift.Int) -> ()
0x3e8=1000. The loop counter is in ECX.

    0x100272dcd <+10477>: movl   $0x3e8, %ecx
I can't figure out what's at these two addresses; lldb didn't seem to accept any reasonable syntax. lldb is terrible. But I'll bet that RBX is holding the value of `total'. I don't know what r14 is, and it doesn't seem to matter since nothing here uses it.

    0x100272dd2 <+10482>: movq   -0xd8(%rbp), %rbx
    0x100272dd9 <+10489>: movq   -0x198(%rbp), %r14
Here's the loop. The loop is unrolled 5 times. total+=result. 0x100274454 produces some kind of exception on integer overflow.

    0x100272de0 <+10496>: addq   %rax, %rbx
    0x100272de3 <+10499>: jb     0x100274454 ; at flatbench.swift:284
    0x100272de9 <+10505>: addq   %rax, %rbx
    0x100272dec <+10508>: jb     0x100274454
    0x100272df2 <+10514>: addq   %rax, %rbx
    0x100272df5 <+10517>: jb     0x100274454
    0x100272dfb <+10523>: addq   %rax, %rbx
    0x100272dfe <+10526>: jb     0x100274454
    0x100272e04 <+10532>: addq   %rax, %rbx
    0x100272e07 <+10535>: jb     0x100274454
The loop was unrolled 5 times, so drop 5 from the loop counter and repeat.

    0x100272e0d <+10541>: addq   $-0x5, %rcx
    0x100272e11 <+10545>: jne    0x100272de0 ; at flatbench.swift:276
Get current time.

    0x100272e13 <+10547>: callq  0x1002b7b38 ; CFAbsoluteTimeGetCurrent
So this code actually times one call to flatDecodeDirect, then 200 iterations of an unrolled do-nothing loop. The compiler has figured out somehow that flatDecodeDirect is going to do exactly the same thing each time, and taken advantage of that by calling it only once. I'm guessing this means that flatDecodeDirect is only called 1,000 times in total.

As a sanity check for this kind of thing - try making a little loop that just increments an integer the appropriate number of times, and see how long that takes. (Check the assembly language output to ensure the generated code is doing what you think - it should be a 2-instruction loop.)

On my laptop that takes 1.8ms. This isn't the absolute limit of how long it takes to do 1,000,000 of anything, but it'll do as a rough estimate. So you should be suspicious if a program suggests it's taking much less time than that to do 1,000,000 of something that's a lot more complicated, as the test did. (It reported 1,000,000 iterations in 0.53ms on my PC.)

(Of course, as with any rough estimate, this only gives you a suspicion, and isn't proof without further investigation.)

I was suspicious, I just could not put my finger on it :)

   for i in 0..<iterations {
     let result = flatuseStruct(outputData, start:i)
     assert(result == 8644311666 + Int(i))
     total2 = total2 + UInt64(result)
   }
this results in around 42ms compared to C 25ms. I guess I should update my blog post :)

Thanks for your help.