| HN Mirror

Y	Hacker News new \| ask \| show \| jobs


	by singron 1871 days ago
	On a GPU you could do a parallel n-ary search. I.e. instead of making 1 test in the center, make n tests. E.g. 1 round of 32-ary search should be equivalent of 5 rounds of binary search. Not only do you avoid the branch predictor, but the memory access is also parallel, which isn't true in branchless binary search. If the width of the search is larger than a warp/wavefront (32/64), then you need to add synchronization and it might not be faster anymore. GPUs aren't really built for latency though and this will waste a lot of accesses on speculation. You are probably better off with a good btree.