Hacker News new | ask | show | jobs
by vvanders 3924 days ago
List is just one part that makes the C++ impl sub-optimal. All the scenes are allocated on the heap so you're actually bouncing around memory twice(once for the next link list entry, second for the actual heap allocation).

Lastly the virtual call again has the chance to cache miss(although probably only once seeing as we've just got a few impls).

In a small example like this there's a good chance that most of the heap allocations are close together but put this approach into production and you can very easily see fragmentation start driving up your cache misses.

(FWIW I'd also store radius*radius on creation, use a BSP/kd-tree and all sorts of other tricks that really help in these types of computations).

1 comments

I did a qucik and dirty test with levels = 14.

    # program          time   change        
    ./ray-ml:          21.46  # Original ocaml
    ./ray-cpp:         17.63  # Original c++
    ./ray-cpp-no-dtor: 13.86  # Removed all cleanup, ocaml doesn't do it
    ./ray-cpp-2:        8.90  # Changed list to vector
    ./ray-cpp-3:        7.89  # Removed virtual functions
    ./ray-cpp-4:        5.64  # Preallocated large array for all Groups
Solid stuff, glad to see someone testing my assumptions :).

If there's one thing I could teach new CS-grads it's how much your memory access patterns(and cache) really-effing-matter. Everything else is almost noise in comparison.

I'm pretty curious about this: have memory access patterns and cache awareness become more of a big deal in the past decade or so, or has it only recently reached my awareness? I studied CS in the early 2000s and while I was aware of various trade-offs between contiguous and linked data structures, it's only in the last few years that the importance of cache (and non-"system" languages' widespread flouting of its importance) has really bubbled up in my consciousness.

It's certainly likely that I just didn't take the right kinds of classes, or that I did but just missed or failed to really internalize this, but I'm curious if something else is going on. Has the bottleneck increasingly become memory rather than CPU? Have we more thoroughly exhausted other areas of optimization?

I was thinking about this while watching a talk[0] given by someone from Google about data structures in the STL, which decried that the standardized specification for `std::unordered_map` more or less forces an implementation to use buckets instead of the more cache-friendly technique of probing. I'm sure that the people who standardized it were very competent, which makes me think that, at least at the time, buckets were thought to be the better implementation.

[0]: https://www.youtube.com/watch?v=fHNmRkzxHWs

Edit: Added video link.

> Has the bottleneck increasingly become memory rather than CPU?

Yes.

Besides the basic memory latency vs. CPU clock speed issue, there's also:

- CPU caches have gotten bigger, so the payoff of making an program cache-friendly is larger

- CPUs have gained SIMD vector ALUs and increased superscalar behavior, so the CPU is capable of processing more data in per clock cycle

- A cultural shift happened after the freewheeling Moore's Law era ended in the mid-2000s, and people started examining performance more closely

http://gec.di.uminho.pt/discip/minf/ac0102/1000gap_proc-mem_...

Thanks, this is what I had suspected, but it seems to be left unstated in everything I've read.