| HN Mirror

Y	Hacker News new \| ask \| show \| jobs


	by ben-schaaf 49 days ago
	For a pretty small N I've found that less clever can be quite a bit faster. I'd try a linear search - possibly SIMD if you can change the data format to struct-of-arrays. An adaptive approach that uses linear search up to a certain N can also yield some benefit.

2 comments

charleslmunger 48 days ago

The first implementation I encountered was a linear search, starting at the last-found field. Empirically it performed better to do a binary search with early exit and branchless bounds selection, I think due to branch predictor pressure. The data representation could be changed but it's tricky, as there are other traversals that want to go in sorted order, and there are lots of places that pass just one pointer for fields. But I agree any further improvement will probably have to come from that.

SIMD is tricky even with SoA because there is significant latency going between the general registers and the vector units, plus arm little cores can be configured to share a vector unit with another core.

link

ben-schaaf 48 days ago

> SIMD is tricky even with SoA because there is significant latency going between the general registers and the vector units

My experience is mostly limited to AMD64, but libraries like glibc use SIMD in many places for faster linear search. Presumably they've done testing and found it worth while.

link

charleslmunger 48 days ago

Yeah arm little cores are a very different story - they aren't superscalar out of order architectures, they can dispatch up to two operations per cycle.

Big cores are more like that dispatching 8 or more operations per cycle, but they're also more expensive, larger, etc.

link

ChadNauseam 49 days ago

If you control the layout, eytzinger layout typically will give you the best of both worlds. As fast as a linear scan for small N, much faster than binary search over a sorted array for large N.

link