| Hardware branch prediction is very powerful (https://www.intel.com/content/www/us/en/developer/articles/t...) but it's also not perfect and costly misdirections may happen (http://ithare.com/infographics-operation-costs-in-cpu-clock-...). Branchless computing (https://www.youtube.com/watch?v=g-WPhYREFjk) is a technique which may help a lot with mitigating the overhead of branching. Alternative way, coming from linux kernel (https://docs.kernel.org/staging/static-keys.html) is to leverage knowledge about branches direction and literally change the instructions via run-time code patching to avoid the potential branching overhead all together. When to go for such extreme solution? When performance really matters: - `and` branches are known at compile-time
- `or` branches are not changing often at run-time
- `and/or` branches are expensive to compute/require memory access
- `and/or` branches are hard to learn by the hardware branch predictor due to their random nature
Some examples include logging, tracing, configuration, hot-path, algo, etc.> https://github.com/boost-ext/sb - x86-64/linux/gcc/clang moves the solution to user space (for run-time) as well as to compile-time with C++20. > static_branch - Minimal overhead run-time branch (https://godbolt.org/z/797WeYYGK) / (nop=not taken / jmp=taken) void fun() { // can be inlined/constexpr or not
if (sb::static_branch<"semi run-time branch">::get()) {
std::puts("taken");
} else {
std::puts("not taken");
}
}
int main() {
fun(); // not taken
sb::static_branch<"semi run-time branch">::set(true);
fun(); // taken
sb::static_branch<"semi run-time branch">::set(false);
fun(); // not taken
}
main: // $CXX -O3
lea rdi, [rip + .L.str.1]
nop # code patching (nop->nop)
lea rdi, [rip + .L.str.2]
.Ltmp1:
call puts@PLT # not taken
call static_branch<"semi run-time branch">::set(true) # relatively slow
lea rdi, [rip + .L.str.1]
jmp .Ltmp2 # code patching (nop->jmp)
lea rdi, [rip + .L.str.2]
.Ltmp2:
call puts@PLT # taken
call static_branch<"semi run-time branch">::set(false) # relatively slow
lea rdi, [rip + .L.str.1]
nop # code patching (nop->nop)
lea rdi, [rip + .L.str.2]
.Ltmp3:
call puts@PLT # not taken
.L.str.1: .asciz "taken"
.L.str.2: .asciz "not taken"
> const_branch - Zero overhead compile-time branch (https://godbolt.org/z/xEboP8WYe) template<auto tag = []{}> // Note: `tag` is required to delay `fun`
// instantiation as get has to be dependent
auto fun() {
if constexpr (sb::const_branch<"compile-time branch">::get<tag>()) {
std::puts("taken");
} else {
std::puts("not taken");
}
}
int main() {
fun(); // not taken
static_assert(sb::const_branch<"compile-time branch">::set<true>());
fun(); // taken
static_assert(sb::const_branch<"compile-time branch">::set<false>());
fun(); // not taken
}
main: // $CXX -O3
lea rbx, [rip + .L.str.1]
mov rdi, rbx
call puts@PLT # not taken
lea rdi, [rip + .L.str.2]
call puts@PLT # taken
mov rdi, rbx
call puts@PLT # not taken
xor eax, eax # return 0
ret
.L.str.1: .asciz "not taken"
.L.str.2: .asciz "taken"
Acknowledgments- https://docs.kernel.org/staging/static-keys.html
- https://gcc.gnu.org/onlinedocs/gcc/Extended-Asm.html
- https://www.intel.com/content/www/us/en/developer/articles/t...
- https://www.agner.org/optimize/instruction_tables.pdf
- https://www.felixcloutier.com/x86
- https://uops.info/table.html
- https://arxiv.org/abs/2308.14185 |