Hacker News new | ask | show | jobs
by miguelraz 754 days ago
I wanted to get too fancy and I tried * LoopVectorization.jl - @turbo choked on the loop * a direct llvmcall to use AVX512 pop count - I malformed the types for the instruction * Defining the `db` as

db = [rand(Int8) for _ in 1:64, j in 1:(10^6)];

to avoid the vec of vecs structure,

and then

function my_cluster!(db, query, k) db .= query .⊻ db popcounts = mapreduce(count_ones, +, db, dims = 1) results = reshape(popcounts, last(size(db))) partialsortperm!(results, results, k) @views results[begin:k] end

...which I couldn't get to be faster than your version. If you use the `partialsortperm!` and reuse the same cache array, I suspect you'll get good speedups, as you won't be sorting the array every time. This is a classic `nth_element` algorithm.

The above is not the most amazing code, but I suspect the lack of indexing will make it ridiculously friendly for a GPU (Edit: Nope, it chokes on `partialsortperm!`).

I'm guessing the manual loopy approach should be just as good but I battled hard to get it somewhat competitive here in 6 lines of code

#@be my_cluster!(X2, q1, 5) Benchmark: 3 samples with 1 evaluation 42.883 ms (17 allocs: 15.259 MiB) 45.711 ms (17 allocs: 15.259 MiB) 46.670 ms (17 allocs: 15.259 MiB)

#@be k_closest(X1, q1, 5) Benchmark: 4 samples with 1 evaluation 27.994 ms (2 allocs: 176 bytes) 28.733 ms (2 allocs: 176 bytes) 29.000 ms (2 allocs: 176 bytes) 30.709 ms (2 allocs: 176 bytes)

I also didn't try using `FixedSizedArrays.jl` as Mose Giordano recommended in my livestream chat.