Hacker News new | ask | show | jobs
by phkahler 1983 days ago
>> My understanding, based on the article, is that a normal processor, we would have expected arr[idx] + arr[idx+1] and arr[idx] to take the same amount of time.

That depends. If the two accesses are on the same cache line, then yes. But since idx is random that will not happen sometimes. He never says how big array[] is in elements or what size each element is.

I thought DRAM also had the ability to stream out consecutive addresses. If so then it looks like Apple could be missing out here.

Then again, if his array fits in cache he's just measuring instruction counts. His random indexes need to cover that whole range too. There's not enough info to figure out what's going on.

2 comments

> There's not enough info to figure out what's going on.

If you only look at the article this is true. However, the source code is freely available: https://github.com/lemire/Code-used-on-Daniel-Lemire-s-blog/...

I ran the benchmark on my system

It's a 6 years old system, fastest times are in the 25ns range

- 2-wise+ is 5% slower than 2-wise

- 3-wise is 46% slower than 2-wise

- 3-wise is 39% slower than 2-wise+

on the M1

- 2-wise+ is 40% slower than 2-wise

- 3-wise is 46% slower than 2-wise

- 3-wise is 4% slower than 2-wise+

Ouch. This is on my $2500 i5 mbpro from 2018.

$ ./two_or_three

N = 1000000000, 953.7 MB

starting experiments.

two : 53.3 ns

two+ : 60.1 ns

three: 78.6 ns

bogus 1375316400

------

2+ 12% slower than 2

3-wise 47% slower than 2

3-wise 30% slower than 2+

-------

Ratios aside, that's an interesting speed leap when the article gets 9 ms for 2-wise. Mind, the laptop had lots of applications running, i didn't clear it up to do a proper benchmark, but still.

Interesting, I ran it on my laptop (i7-7700HQ) with the following results:

- 2-wise+ is 19% slower than 2-wise

- 3-wise is 48% slower than 2-wise

- 3-wise is 25% slower than 2-wise+

However, as mentioned in the post the numbers can vary a lot, and I noticed a maximum run-to-run difference of 23ms on two-wise.

I tried it on my old (2009) 2.5GHz Phenom II X4 905e (GCC 10.2.1 -O3, 64 bit) and got results almost perfectly matching the conventional wisdom:

  two  : 97.4 ns
  two+  : 97.9 ns
  three: 145.8 ns
He's only got 3 million random[] numbers. Weather that's enough depends on the cache size. It also bothers me to read code like this where functions take parameters (like N) and never use them.
TLDR: he is using a random index with a big enough array
He mentioned it's a 1GB array, and the source code is available.
That array is indexed by an array of random numbers and there are only 3M of them. That should be enough assuming even 4 bytes per index it will just fit in the 12MB cache, but then there are accesses to the big array as well.