Hacker News new | ask | show | jobs
by eklitzke 1082 days ago
Rearranging branches (and perhaps blocks too?) will definitely be done if you are building using FDO, because without FDO (or PGO) the compiler has no idea how likely each branch is to be taken. Cmov can also be enabled by FDO in some cases.

However, whether or not using cmov is effective compared to a regular test/jump is highly dependent on how predictable the branch is, with cmov typically performing better when the branch is very unpredictable. Since they got a 6x speedup with cmov, I assume that their test input (which isn't described in the post, and is also not in their GitHub repo) consists of random strings consisting almost entirely of s and p characters. There's nothing wrong with this, but it does make the post seem a little misleading to me, as their clever speedup is mostly about exploiting an unmentioned property of the data that is highly specific to their benchmark.

2 comments

> I assume that their test input (which isn't described in the post, and is also not in their GitHub repo) consists of random strings consisting almost entirely of s and p characters.

test code is here: https://github.com/414owen/blog-code/blob/master/02-the-same... it randomly selects between 's' or 'p'. The characters can't be anything other than 's', 'p', or the terminating null. Knowing that particular fact about our input gives us this ...clever... optimization:

    int run_switches(const char* s) {
        int result = 0;
        while (*s)
            result += (1 | *s++) - 'r';
        return result;
    }
which compiles to:

    run_switches:
            movzx   eax, BYTE PTR [rdi]
            xor     edx, edx
            test    al, al
            je      .L1
    .L3:
            or      eax, 1
            inc     rdi
            movsx   eax, al
            lea     edx, [rdx-114+rax]
            movzx   eax, BYTE PTR [rdi]
            test    al, al
            jne     .L3
    .L1:
            mov     eax, edx
            ret
This is too clever by half, of course, but it perfectly illustrates your point about exploiting properties of the data.
You do not even need to subtract 'r'.

    int run_switches(const char* s) {
        int s_count = 0;
        const char *begin = s;
        while(*s) {
            s_count += (1 & *s++);
        }
        int count = s-begin;
        return count - s_count;
    }
which compiles to:

    .L49:
        and     edx, 1
        add     rax, 1
        add     ecx, edx
        movzx   edx, BYTE PTR [rax]
        test    dl, dl
        jne     .L49
edit: other variant

    int run_switches2(const char* s) {
        const char *begin = s;
        int sum = 0;
        while(*s) {
            sum += *s++;
        }
        int count = s-begin;
        int s_count = sum - ('s'*count)/('p'-'s');
        int p_count = count - s_count;
        return p_count - s_count;
   }
which compiles to:

   run_switches2(char const*):
        movsx   eax, BYTE PTR [rdi]
        test    al, al
        je      .L56
        mov     rdx, rdi
        xor     ecx, ecx
   .L55:
        add     rdx, 1
        add     ecx, eax
        movsx   eax, BYTE PTR [rdx]
        test    al, al
        jne     .L55
        sub     rdx, rdi
        imul    esi, edx, 115
        movsx   rax, esi
        sar     esi, 31
        imul    rax, rax, 1431655766
        shr     rax, 32
        sub     eax, esi
        add     ecx, eax
        sub     edx, ecx
        mov     eax, edx
        sub     eax, ecx
        ret
   .L56:
        xor     eax, eax
        ret
None of these will beat the clever blocked SIMD someone showed elsethread.
> because without FDO (or PGO) the compiler has no idea how likely each branch is to be taken

So, the maximum amount of times you can hit '\0' is once in the string, because then the function returns, but you can hit the other characters many times, which seems to be information a compiler has access to without PGO.

PGO does help, of course, and on my machine gives me 2.80s, which is better than the code at the end of the `Rearranging blocks` section :)

> I assume that their test input (which isn't described in the post, and is also not in their GitHub repo)

It's described under `Benchmarking setup`, and is in the repository here: https://github.com/414owen/blog-code/blob/master/01-six-time...

Side note: There's a part two to this post (linked at the bottom) where I make the C code as fast as I possibly can, and it beats all the assembly in this post.

I never said writing assembly is (necessarily) a good idea, I just find optimizing it, and deciphering compiler output, an interesting challenge, and a good learning opportunity.

Imagine a scenario where most of the strings being processed contain a single null character, with no other characters. In that case checking for the null character first would be optimal.

Does the compiler know that this isn't true? No, it doesn't. The author of the article is making an assumption about the contents of the data that might seem reasonable but isn't necessarily true.

But because in the single-null case the loop body is executed only once, the gains of arrangement that prefers nulls are pretty slim compared to long-string cases where the loop body is executed many times. For example if your dataset contains 99 cases of single null strings and one case of 100 chars long string, it might still be optimal on aggregate to use the long-string optimizing arrangement.

Of course there are still some cases where non-zero strings are extremely rare and as such optimizing for those makes sense.