Hacker News new | ask | show | jobs
by flgr 3279 days ago
I think in order to really reap the benefits of this you'll want to actually reorganize the in-memory layout of the trees in order to make sure all elements needed for the comparison end up in the same cache line.

I haven't been following this closely, but the last time I checked scatter-gather loads were really really slow.

Chapter 3.3 from page 27 (PDF page 43) on of this be interesting: https://www.researchgate.net/profile/Florian_Gross/publicati...

Also contains a survey of some other related data structures and algorithms.

2 comments

As the saying goes, "memory is the new disk". Data structures like B-trees which were designed to improve the performance of data accesses from disk are now useful for improving the performance of algorithms in memory.
Does anybody have any data on whether external sorting [1] is the way to go these days for CPU based sorts?

That is, use quicksort for chunks up to the size of the L1/L2/L3 cache (depending on benchmarking) and then mergesort the chunks.

[1] https://en.wikipedia.org/wiki/External_sorting

You can have a look at the best sorting algorithms (for various definitions of "best") here http://sortbenchmark.org/
That's for big sorts. The question here is about sorting with a single CPU.
Look at the Joule category.
I don't have any data readily available, but it's generally understood that you want a cache-aware sort, yes. Usually I see people opt for samplesort rather than sorting chunks and merging, though.
Thanks for your opinion.

I did not know about samplesort - found it on Wikipedia.

Thank you, scatter-gather was the google term I was looking for, though I couldn't immediately find any architectures that implement a vector scatter-gather. Are you aware of any?
Not sure what could be a scalar scatter/gather, but NV and AMD GPUs do the vector one (you can read write n values from n offsets with a single instruction).
The Cray absolutely did. How else do you do sparse matrix multiply?