| HN Mirror

Y	Hacker News new \| ask \| show \| jobs

by xoranth 1073 days ago

I think I managed to improve on both this post, and its sequel, at the cost of specializing the function for the case of a string made only of 's' and 'p'.

The benchmark only tests strings made of 's' and 'p', so I think it is fair.

The idea is as follow. We want to increase `res` by one when the next character is `s`. Naively, we might try something like this:

    res += (c - 'r');  // is `res += 1` when c == 's'

This doesn't work, as `'p' - 'r' == -2`, and we'd need it to be -1.

But `'p' - 'r'`, when viewer as an unsigned integer, underflows, setting the carry flag. Turns out x64 has an instruction (adc) that adds two registers _plus_ the carry flag.

Therefore we can replace two `cmp, cmov` with one `sub, adc`:

    run_switches:
            xor    eax, eax            # res = 0
    loop:
            movsx  ecx, byte ptr [rdi]
            test   ecx, ecx
            je     ret
            inc    rdi
            sub    ecx, 'r'
            adc    eax, ecx     # Magic happens here
            jmp    loop
    ret:
            ret

Benchmarks are as follows (`bench-x64-8` is the asm above):

    Summary
      '01-six-times-faster-than-c/bench-x64-8 1000 1' ran
        1.08 ± 0.00 times faster than '02-the-same-speed-as-c/bench-c-4-clang 1000 1'
        1.66 ± 0.00 times faster than '01-six-times-faster-than-c/bench-x64-7 1000 1'

Of course, one could improve things further using SWAR/SIMD...

2 comments

414owen 1073 days ago

Very interesting approach. I should probably have specified that the somewhat naive assembly in `02-the-same-speed-as-c/loop-5.x64.s` is the fastest version I have.

On my machine I'm getting 0.244s for `loop-5.x64.s` and 0.422s for your implementation above.

I'm not sure why exactly we're seeing this discrepancy, and for what it's worth your implementation looks faster to me. I guess this is why you need to always benchmark on the hardware you're going to be running the code on...

link

xoranth 1073 days ago

I rerun the benchmark vs loop-5 and loop-7 from the second post. Runtime is basically the same on my machine.

I would have expected yours to be faster given that it needs to execute fewer instructions per loop iteration. Though maybe the CPU can run `adc` on more ports compared to a load from memory?

    Summary
      '01-six-times-faster-than-c/bench-x64-8 1000 1' ran
        1.00 ± 0.00 times faster than '02-the-same-speed-as-c/bench-x64-7 1000 1'
        1.66 ± 0.00 times faster than '01-six-times-faster-than-c/bench-x64-7 1000 1'

    Summary
      '01-six-times-faster-than-c/bench-x64-8 1000 1' ran
        1.01 ± 0.00 times faster than '02-the-same-speed-as-c/bench-x64-5 1000 1'
        1.66 ± 0.00 times faster than '01-six-times-faster-than-c/bench-x64-7 1000 1'

link

gpderetta 1072 days ago

Even simpler: just sum all elements of the array. Then at the end subtract 'p'*len from the sum, then divide by ('s'-'p') to get the s count. The 'p' count is len minus the 's' count.

The initial sum is easily vectorized as well.

If I've not made any mistakes it should work. Only issue is possible overflow on the running sum.

Can't be bothered to benchmark it though:).

edit: missed the decrement when you see 's'. So the final result is p_count - s_count.

link

xoranth 1072 days ago

That's likely the fastest way to do that without vectorization. But you'd need to upcast 's' to an uint64 (or at least an uint32). That means that vectorization would operate on 32/64 bit lanes.

With vectorization, I think the way to go is to have two nested loops, an outer advances by 32 * 255 elements at a time, and an inner one that loads 32 bytes, compares each character to 's', and accumulates on 8 bit lanes.

Then in the outer loop you do an horizontal sum of the 8 bit accumulators.

link

fuber2018 1072 days ago

My SWAR version almost does what your vectorization algorithm description does - just that the SWAR-code looks rather gnarly because the compiler isn't auto-generating the vector code for you, it's hand-coded in C by me and I'm limited to 64 bits at a time.

link

gpderetta 1072 days ago

Indeed, the blocked vectorization with 8 bits accumulators shown elsethread is going to be faster and there reducing the sum to 1 bit per iteration is worth it.

link

fuber2018 1072 days ago

I took the 64-bit SWAR ('S'IMD-'W'ithin-'A'-'R'egister) road and passed in the string length - the calling code has the length "right there"!!!

Using the original run_switches function, app took 3.554s (average of 10 runs).

With the SWAR-version with the string length passed in, app took 0.117s (average of 10 runs).

That's an overall 27.6x speedup.

link

fuber2018 1072 days ago

If I unroll the main while loop to handle 4x as much each time through the loop in the SWAR-version, the runtime drops to 0.0562s (average 10 runs).

That's an overall 57.5x speedup.

link

fuber2018 1072 days ago

If I convert the unrolled-64-bit SWAR function to use 32-bit chunks instead, average runtime almost doubles, approx. 0.1s now.

Need sleep now.

link

fuber2018 1072 days ago

If I unroll the 64-bit SWAR version by 8x instead of 4x, the runtime is reduced by another 10% over the 4x-unrolled SWAR version. Diminishing returns...

link