| I wanted to get too fancy and I tried
* LoopVectorization.jl - @turbo choked on the loop
* a direct llvmcall to use AVX512 pop count - I malformed the types for the instruction
* Defining the `db` as db = [rand(Int8) for _ in 1:64, j in 1:(10^6)]; to avoid the vec of vecs structure, and then function my_cluster!(db, query, k)
db .= query .⊻ db
popcounts = mapreduce(count_ones, +, db, dims = 1)
results = reshape(popcounts, last(size(db)))
partialsortperm!(results, results, k)
@views results[begin:k]
end ...which I couldn't get to be faster than your version. If you use the `partialsortperm!` and reuse the same cache array, I suspect you'll get good speedups, as you won't be sorting the array every time. This is a classic `nth_element` algorithm. The above is not the most amazing code, but I suspect the lack of indexing will make it ridiculously friendly for a GPU (Edit: Nope, it chokes on `partialsortperm!`). I'm guessing the manual loopy approach should be just as good but I battled hard to get it somewhat competitive here in 6 lines of code #@be my_cluster!(X2, q1, 5)
Benchmark: 3 samples with 1 evaluation
42.883 ms (17 allocs: 15.259 MiB)
45.711 ms (17 allocs: 15.259 MiB)
46.670 ms (17 allocs: 15.259 MiB) #@be k_closest(X1, q1, 5)
Benchmark: 4 samples with 1 evaluation
27.994 ms (2 allocs: 176 bytes)
28.733 ms (2 allocs: 176 bytes)
29.000 ms (2 allocs: 176 bytes)
30.709 ms (2 allocs: 176 bytes) I also didn't try using `FixedSizedArrays.jl` as Mose Giordano recommended in my livestream chat. |