|
|
|
|
|
by dalke
2073 days ago
|
|
Sorry, I see how what I wrote could be interpreted that way, and part of what I wrote was out of ignorance. While I didn't write it, I assumed the macro was doing some sort of equivalent to the template metaprogramming I've heard about in C++ to do similar things. What I did't see was how I could use the AVX2 instructions myself. Checking now, since Julia's count_ones() maps to the LLVM popcount instruction, and recent clang versions know how to optimize that fixed-length sequence in C even for AVX-512, the Julia equivalent to the code I wrote should have good performance. There are a few optimizations (keeping one AVX register loaded with a constant byte string, and using prefetch instructions) which might be missing. I'll be talking with the conference participant who brought up Julia to work this out in more detail. Thanks for the comment! |
|
> What I did't see was how I could use the AVX2 instructions myself.
If you ever find yourself in a situation where you want manual control over vectorization, the package SIMD.jl [1] is pretty good for manual, handwritten vectorization. There's also VectorizationBase.jl [2] which LoopVectorization.jl uses. Which one of these two packages are most appropriate just kinda depends on what sort of interface you prefer.
[1] https://github.com/eschnett/SIMD.jl
[2] https://github.com/chriselrod/VectorizationBase.jl