Hacker News new | ask | show | jobs
by andrepd 2329 days ago
"without a single branch" I thought it meant actually a branchless version of wc. Turns out it's just no explicit if statements.
2 comments

If you were careful, it seems pretty plausible to be able to use some indexing tricks and CMOV to write a jump/branchless version of wc. You are basically counting newlines and runs of whitespace.
Yeah, it's Nnt just plausible, it's fairly straightforward. Here are some untested versions that have branch-free inner loops for lines (easiest, this will probably compile branch-free even if you aren't careful), and words:

https://godbolt.org/z/JNU-_G

CMOV aside, I remember it was proved MOV itself is turing-complete.
slightly related, does embedding arithmetic in mov instructions go faster than explicit ALU operations ?
Depending on the arithmetic, it seems that yes it can! I've noticed gcc using LEA instructions for arithmetic of the form (x * a + b), where 'a' and 'b' fit with the instruction.
Using LEA (load effective address) for calculation seems to be pretty common in most programs for both gcc and llvm. You basically get smaller code, and it used to schedule them better across more execution ports. Not sure if the CPU can fuse the ADD+MUL now.
"Counting newlines" implies branching.
You could subtract the value of '\n' from each byte, bitwise-or the result with its negated value, right shift the sign bit into the least significant position, and push that into an accumulator.

Basically, it is possible to reduce a byte into a one or a zero if it does or does not match a value using only bitwise instructions. I don't know if the way I just described uses the fewest instructions, but you can definitely do it without branching.

In fact, you can do this to test any integer comparison. Start by subtracting the arguments to the comparison operator, and then use one of these "squash" functions for the comparison operator in question (assuming 32 bit signed integers are being compared):

== : ~(c | -c) >>> 31

!= : (c | -c) >>> 31

> : -c >>> 31

>= : ~c >>> 31

< : c >>> 31

<= : ~-c >>> 31

Not really:

   cnt += *ptr++ == '\n'
Should compile down to no branches (not even cmov) by summing the value of the flag register directly. Would be a it hard to stop without taking a branch though. Would you consider function pointer calls a branch? If that's too easy, what about taking a segfault?
Even if you do the branchiest thing possible in the source, no bitwise or "looks cmov-y" stuff, all compilers that matter will do this without a branch with optimization on:

https://godbolt.org/z/XCBGYW

Only icc vectorizes this at -O2 (still branchless, of course), but clang and gcc vectorize it if you go to -O3.

Sure, but where's the code golfing fun in there ;)
:) I guess I'm not fun at [code golfing] parties.

I just want to gently push back on the notion that the source level branchy-ness is tightly tied to the generated assembly level branchiness.

Sometimes, it is - but it is a long topic to characterize when. For simple things like counters based on a condition, you'll almost always get branch-free. Same for assignment/return value based on a condition, where the possibilities don't involve lots of asymmetric work (or a asymmetric memory access).

Almost everyone in tech, it seems, interprets “without X” as meaning “without explicit X :(