| HN Mirror

Y	Hacker News new \| ask \| show \| jobs

by rjtobin 2172 days ago

The cartoon picture is that the first example will read everything into cache once, whereas the second example will read everything into cache twice.

Cache lines are typically 64-bytes, so to write a single character to main memory the following things happen (again, a cartoon picture): First read the 64-bytes area that contains the byte of interest so that it is owned by my cache (this is called a RFO, "read-for-ownership"). Second, update the byte of interest. Thirdly (at some point) write the cache-line back to main memory.

In the sequential case, we just read one 64-byte cache line at a time, update those 64 chars, then write the cache line back to main memory.

In the second example, we first update all the even-indexed characters, which still forces us to read in every cache line. Then we loop around and do the odd-indexed characters, at which point we have to read the cache lines all over again (assuming the array is big enough that the whole thing can't fit in cache at once).

1 comments

bananaface 2172 days ago

Am I misreading the second example's algorithm? Isn't it allocating like this:

1, 2, 4 3, 8 7 6 5, 16 15 14 13 12 11 10 9, ...

And so on?

---

Also this part:

>Cache lines are typically 64-bytes

Right, but I thought when you access an index it caches quite a lot more than 64 bytes from the index. Doesn't it throw a larger chunk of the array onto multiple lines? If that's the case then the first example is making very efficient use of the cache. If the modern CPUs are smart enough to cache backwards and I understand the second example, isn't the second too?

link

bananaface 2172 days ago

Ok turns out I was way off, it's actually completely broken. Just ran the code, printed j at each index.

>>> main(20)

2 4 8 16 12 4 8 16 12 4 8 16 12 4 8 16 12 4 8 16

It's not even hitting odd indexes. Over half the array will be garbage at the end. I guess that would count as out-of-order though.

link

rjtobin 2172 days ago

Yep, I misread it also: I saw j = 2 * i (which would do evens and then odds when NUMBER is odd, or evens then evens again if NUMBER is even).

For what it really is - powers of 2 mod NUMBER - when NUMBER is large most reads should be out of the cache. So the first example has to read from main memory only every 64th index, and the second example has to read from main memory on almost every read. I think this agrees with what you are saying. This also explains why it is ~5x slower, which seemed too large from my previous understanding.

link

bananaface 2171 days ago

>So the first example has to read from main memory only every 64th index

Wouldn't it be 8 per cache line (an int is 64 bits, each cache line is 64 bytes)? I'm also assuming it caches a larger chunk of the array across multiple lines. Is that not how it works?

But I think there's a more fundamental issue here, which is that the amount measured, 68 million bytes in a second, is what - 60Mb? Did he just reduce the array size until it completed in a second? Because a very significant chunk of that is going to fit in L3 cache (on an i7 it's 8Mb), so even if you had a good random access algorithm, it would understate the problem because the data is still contiguous.

Which seems kinda dumb to me, since the real-word problem you're likely to run into is when your data is stored non-contiguously because it's scattered across multiple different structs/objects, making it impossible to utilise the cache to a significant degree at all. In that (very common under OO or interpreted languages) situation I'd expect a way more dramatic slowdown.

link