Hacker News new | ask | show | jobs
by kris-jusiak 700 days ago
Hardware branch prediction is very powerful (https://www.intel.com/content/www/us/en/developer/articles/t...) but it's also not perfect and costly misdirections may happen (http://ithare.com/infographics-operation-costs-in-cpu-clock-...).

Branchless computing (https://www.youtube.com/watch?v=g-WPhYREFjk) is a technique which may help a lot with mitigating the overhead of branching.

Alternative way, coming from linux kernel (https://docs.kernel.org/staging/static-keys.html) is to leverage knowledge about branches direction and literally change the instructions via run-time code patching to avoid the potential branching overhead all together.

When to go for such extreme solution? When performance really matters:

  - `and` branches are known at compile-time
  - `or` branches are not changing often at run-time
      - `and/or` branches are expensive to compute/require memory access
      - `and/or` branches are hard to learn by the hardware branch predictor due to their random nature
Some examples include logging, tracing, configuration, hot-path, algo, etc.

> https://github.com/boost-ext/sb - x86-64/linux/gcc/clang moves the solution to user space (for run-time) as well as to compile-time with C++20.

> static_branch - Minimal overhead run-time branch (https://godbolt.org/z/797WeYYGK) / (nop=not taken / jmp=taken)

    void fun() { // can be inlined/constexpr or not
      if (sb::static_branch<"semi run-time branch">::get()) {
        std::puts("taken");
      } else {
        std::puts("not taken");
      }
    }

    int main() {
      fun(); // not taken
    
      sb::static_branch<"semi run-time branch">::set(true);
      fun(); // taken
    
      sb::static_branch<"semi run-time branch">::set(false);
      fun(); // not taken
    }

    main: // $CXX -O3
      lea rdi, [rip + .L.str.1]
      nop # code patching (nop->nop)
      lea rdi, [rip + .L.str.2]
     .Ltmp1:
      call puts@PLT # not taken
    
      call static_branch<"semi run-time branch">::set(true) # relatively slow
    
      lea rdi, [rip + .L.str.1]
      jmp .Ltmp2 # code patching (nop->jmp)
      lea rdi, [rip + .L.str.2]
     .Ltmp2:
      call puts@PLT # taken
    
      call static_branch<"semi run-time branch">::set(false) # relatively slow
    
      lea rdi, [rip + .L.str.1]
      nop # code patching (nop->nop)
      lea rdi, [rip + .L.str.2]
     .Ltmp3:
      call puts@PLT # not taken
    
    .L.str.1: .asciz "taken"
    .L.str.2: .asciz "not taken"
> const_branch - Zero overhead compile-time branch (https://godbolt.org/z/xEboP8WYe)

    template<auto tag = []{}> // Note: `tag` is required to delay `fun`
                              // instantiation as get has to be dependent
    auto fun() {
      if constexpr (sb::const_branch<"compile-time branch">::get<tag>()) {
        std::puts("taken");
      } else {
        std::puts("not taken");
      }
    }

    int main() {
      fun(); // not taken
    
      static_assert(sb::const_branch<"compile-time branch">::set<true>());
      fun(); // taken
    
      static_assert(sb::const_branch<"compile-time branch">::set<false>());
      fun(); // not taken
    }

    main: // $CXX -O3
      lea  rbx, [rip + .L.str.1]
      mov  rdi, rbx
      call puts@PLT # not taken
    
      lea  rdi, [rip + .L.str.2]
      call puts@PLT # taken
    
      mov  rdi, rbx
      call puts@PLT # not taken
    
      xor  eax, eax # return 0
      ret
    
    .L.str.1: .asciz "not taken"
    .L.str.2: .asciz "taken"
Acknowledgments

- https://docs.kernel.org/staging/static-keys.html - https://gcc.gnu.org/onlinedocs/gcc/Extended-Asm.html - https://www.intel.com/content/www/us/en/developer/articles/t... - https://www.agner.org/optimize/instruction_tables.pdf - https://www.felixcloutier.com/x86 - https://uops.info/table.html - https://arxiv.org/abs/2308.14185