Hacker News new | ask | show | jobs
by birdbrain 2479 days ago
I don't want to detract from the cleverness here, but I believe your benchmarks could use some work. Here are a few suggestions:

1. Simply testing something 1000 times and (presumably) presenting the arithmetic mean is not very informative. Looking at the detailed reported benchmark times (in the output file in tests), it looks like many of the timing outcomes have high variance. Rather than running the tests 1000 times and taking the mean, you might consider running 10 batches of 100 tests (or 1000, if you can) and presenting the mean and variance of the resulting distribution. In general, k sample groups each of size p will provide more reliable information about the underlying distribution than one sample group of size k*p (for reasonable k and p, obviously).

2. Related to that, the results of the "inserting a number of elements" and "deleting a number of elements" tests are significantly worse for the tiered vector vs the std::vector than the "insert/delete a single element" tests. You don't mention this in the readme, but thinking about why it is might be informative. Thrashing seems like a possible explanation, and one you might be able to mitigate.

3. Are you making sure your cache is warm before starting to measure performance? (Pardon, I didn't look through every line of your tests.) Particularly for std::vector, and likely your intermediate deques too, this will have a big effect on timing.

4. Finally, it looks like you're primarily testing using ints (?). It would probably be a good idea to see if your results hold for a different payload size.

I don't know whether these will improve or worsen your comparison against std::vector, but they will make your claims more robust.

2 comments

He's going to have cache issues since he requires one extra memory lookup.

His lookup should be roughly twice as expensive as a regular array when cache is cold - it will be dominated by the cache misses two vs one.

With a hot cache, the array lookup becomes a single load op of a few cycles, and even if he gets two cache hits, his will probably be about 6-10 ops with 2 loads, a div, and a few more ops to get the index of the sub bucket.

On one cache hit and a possible miss on the other (eg his top level index is in cache, but the bucket might not be), he's going to be getting high variance in his lookup timings. OP: to help the iterator access (sum), you might be able to prefetch (eg force a read) of the folling bucket before you need it. when you move over the bucket 2, cause a fetch of bucket 3 and the same time. You might be able to hide some of that cache miss latency then.

Also, with the DEQs you are potentially starting at the middle of a slab and having to roll around its front. That isn't going to prefetch well, so you might want to fetch the next next slab's base address and the address it logically starts at since those cane be different. Just try to hide as much cache miss latency as you can because that is where you are probably getting killed on in the iterator access.

Thanks for feedback!