| HN Mirror

Y	Hacker News new \| ask \| show \| jobs

by poorlyknit 632 days ago

The fact that it happens

    1) at/around powers of two
    2) regardless whether using GC or preallocating

makes me think its got to do with cache sizes. For example, the M2 macbook has around 16MiB of cache, which gives you approximately 2 million ~ 2^20 nodes where each node has a 64 bit number and a 64 bit pointer.

The jumps in your measurements could be related to the cache hierarchy. On Linux (probably macOS too) you should be able to run "perf stat ./mergesort" to show you the hitrate of your CPU's caches.

I'd be interested in a follow-up post :)

1 comments

bediger4000 631 days ago

Thanks for reading my blog post and thinking about it. I thought of cache as well, the 3 machines I'm running benchmarking on are all very different. The M2 Macbook has 128-byte cache lines, as opposed to the other 2 machines, which have 64-byte ache lines. The old Dell R530 has level 1 through 3 caches, the others have only levels 1 and 2. I'm still trying to understand CPU caching and see if I can correlate it to the performance drops somehow.

The linked list nodes are 16 bytes, and both 8-byte fields (.Next and .Data) get read and written during sorting. Up to a point, larger node sizes,which I would thing would change caching, don't change where the performance drops occur: https://bruceediger.com/posts/mergesort-investigation-7/

Sorting an already sorted, or reverse sorted list, doesn't exhibit the performance drops either: https://bruceediger.com/posts/mergesort-investigation-9/

A purely recursive mergesort doesn't show the performance drops: https://bruceediger.com/posts/mergesort-investigation-3/

link

justinclift 631 days ago

Is the compiler, or a library you're using, somehow adding code that takes a different strategy at runtime when the linked list gets bigger than the performance-fall-off size?

link

bediger4000 630 days ago

Good question, but I don't think so. Two mergesort variants show it, one does not. A C transliteration of one of the variants that show it, also shows performance drops. The Go compiler for MacOS on M2 silicon probably generates very different code than the X86_64 on Linux, yet they both have the same performance drops.

The code only uses standard Go packages. I wrote the linked list struct, I'm not using a package's struct or code. I don't even use standard package code during sorting, only in timing and list preparation.

What you suggest may be true, but it seems very unlikely.

link

justinclift 630 days ago

No worries. It does sound like something else is going on then. :)

link