Hacker News new | ask | show | jobs
by djacobs7 1983 days ago
Is the article saying that the M1 is slower than we would have expected in this case?

My understanding, based on the article, is that a normal processor, we would have expected arr[idx] + arr[idx+1] and arr[idx] to take the same amount of time.

But the M1 is so parallelized that it goes to grab both arr[idx] and arr[idx+1] separately. So we have to wait for both of those two return. Meanwhile, on a less parallelized processor, we would have done arr[idx] first and waited for it to return, and the processor would realize that it already had arr[idx+1] without having to do the second fetch.

Am I understanding this right?

2 comments

>> My understanding, based on the article, is that a normal processor, we would have expected arr[idx] + arr[idx+1] and arr[idx] to take the same amount of time.

That depends. If the two accesses are on the same cache line, then yes. But since idx is random that will not happen sometimes. He never says how big array[] is in elements or what size each element is.

I thought DRAM also had the ability to stream out consecutive addresses. If so then it looks like Apple could be missing out here.

Then again, if his array fits in cache he's just measuring instruction counts. His random indexes need to cover that whole range too. There's not enough info to figure out what's going on.

> There's not enough info to figure out what's going on.

If you only look at the article this is true. However, the source code is freely available: https://github.com/lemire/Code-used-on-Daniel-Lemire-s-blog/...

I ran the benchmark on my system

It's a 6 years old system, fastest times are in the 25ns range

- 2-wise+ is 5% slower than 2-wise

- 3-wise is 46% slower than 2-wise

- 3-wise is 39% slower than 2-wise+

on the M1

- 2-wise+ is 40% slower than 2-wise

- 3-wise is 46% slower than 2-wise

- 3-wise is 4% slower than 2-wise+

Ouch. This is on my $2500 i5 mbpro from 2018.

$ ./two_or_three

N = 1000000000, 953.7 MB

starting experiments.

two : 53.3 ns

two+ : 60.1 ns

three: 78.6 ns

bogus 1375316400

------

2+ 12% slower than 2

3-wise 47% slower than 2

3-wise 30% slower than 2+

-------

Ratios aside, that's an interesting speed leap when the article gets 9 ms for 2-wise. Mind, the laptop had lots of applications running, i didn't clear it up to do a proper benchmark, but still.

Interesting, I ran it on my laptop (i7-7700HQ) with the following results:

- 2-wise+ is 19% slower than 2-wise

- 3-wise is 48% slower than 2-wise

- 3-wise is 25% slower than 2-wise+

However, as mentioned in the post the numbers can vary a lot, and I noticed a maximum run-to-run difference of 23ms on two-wise.

I tried it on my old (2009) 2.5GHz Phenom II X4 905e (GCC 10.2.1 -O3, 64 bit) and got results almost perfectly matching the conventional wisdom:

  two  : 97.4 ns
  two+  : 97.9 ns
  three: 145.8 ns
He's only got 3 million random[] numbers. Weather that's enough depends on the cache size. It also bothers me to read code like this where functions take parameters (like N) and never use them.
TLDR: he is using a random index with a big enough array
He mentioned it's a 1GB array, and the source code is available.
That array is indexed by an array of random numbers and there are only 3M of them. That should be enough assuming even 4 bytes per index it will just fit in the 12MB cache, but then there are accesses to the big array as well.
Its a little confusing because they're conflating the idea that you almost certainly read at least the entire word (and not a single byte) at a time with the other idea that you could fetch multiple words concurrently.
Any cached memory access is going to read in the entire cache line -- 64 bytes on x86, apparently 128 on M1. This is true across most architectures which use caches; it isn't specific to M1 or ARM.
(As I learned from recent Rust concurrency changes) on newer Intel, it usually fetches two cache lines so effectively 128 bytes while AMD usually 64 bytes. That's the sizes they use for "cache line padded" values (I.e making sure to separate two atomics by the fetch size to avoid threads invalidating the cache back and forth too much).
To be clear here, it fetches two cache lines but it doesn’t put the second in exclusive state until it’s written to; the unit of granularity is still 64b. In a scanning read mode you will see the benefit but you won’t see the contention on writes. (The contention will come from subsequent reads on that cache line though)
Yes almost certainly more than the word will be read but it varies by architecture. I would think almost by definition no less than a word can be read so I went with that in my explanation.