Hacker News new | ask | show | jobs
by strictfp 4074 days ago
Nice article. It inspired me to look around for some more straightforward way of optimizing, and I found the setcc class of instructions: http://www.nynaeve.net/?p=178

I'm thinking that this combined with some CAS (CMPXCHG8B) could acheive the same, right?

Something like (pseudo):

Comparewith(4)

Ifequalstore(54)

Ifnotequalstore(2)

Return

2 comments

I think CAS is a pretty slow operation even without a LOCK prefix. You probably don't want to use it for purposes other than intercore synchronization.

If you have a lot of data to process, using SSE/AVX is a huge win. Conditional masking and min/max instructions for example.

SIMD is a huge win especially in sorting, you can have 10-40x speed-up by using a bitonic sorting network.

Aren't setcc/cmov* instructions effectively similar to a branch? To compute the result you need to execute the previous instruction.

I suppose that these instructions do not cause the instruction pipeline to be flushed, compared to an incorrectly predicted jump, but they still stall until the previous instruction has been executed.

jmp < setcc/cmov* < branchless conditionals

Conditional moves have data dependencies on their input arguments, but so do the "branchless" versions presented in the article.