Hacker News new | ask | show | jobs
by vvanders 3926 days ago
Kinda missing the point of C++ by using a linked-list.

If you're going to spend the time to implement an algorithm in C++ at least take the time to make it cache aware. Otherwise it's not really surprising when the performance is the same when you're doing the exact same thing.

Would be curious to see what the performance looks like with a tightly packed scene and static dispatch on a discrete number of primitives instead of using virtual functions.

1 comments

I was curious about whether using a vector over list would have any effect even though the size is so small and the answer seems no (I guess as it could be expected).

It seems like the listed OCaml program has a default level of 9 where as the C++ one uses 6?

Times using original listing (minus print stmts):

  $ time ./ray-cpp  
  real    0m3.369s  
  user    0m3.347s  
  sys     0m0.010s

  $ time ./ray-ml   
  real    0m4.710s  
  user    0m4.703s  
  sys     0m0.003s
Time after changing the 9 in the ocaml program to a six:

  $ time ./ray-ml  
  real    0m4.032s  
  user    0m4.026s  
  sys     0m0.003s
Seems like C++ has a good edge. Above was ran on a 64 bit machine.
List is just one part that makes the C++ impl sub-optimal. All the scenes are allocated on the heap so you're actually bouncing around memory twice(once for the next link list entry, second for the actual heap allocation).

Lastly the virtual call again has the chance to cache miss(although probably only once seeing as we've just got a few impls).

In a small example like this there's a good chance that most of the heap allocations are close together but put this approach into production and you can very easily see fragmentation start driving up your cache misses.

(FWIW I'd also store radius*radius on creation, use a BSP/kd-tree and all sorts of other tricks that really help in these types of computations).

I did a qucik and dirty test with levels = 14.

    # program          time   change        
    ./ray-ml:          21.46  # Original ocaml
    ./ray-cpp:         17.63  # Original c++
    ./ray-cpp-no-dtor: 13.86  # Removed all cleanup, ocaml doesn't do it
    ./ray-cpp-2:        8.90  # Changed list to vector
    ./ray-cpp-3:        7.89  # Removed virtual functions
    ./ray-cpp-4:        5.64  # Preallocated large array for all Groups
Solid stuff, glad to see someone testing my assumptions :).

If there's one thing I could teach new CS-grads it's how much your memory access patterns(and cache) really-effing-matter. Everything else is almost noise in comparison.

I'm pretty curious about this: have memory access patterns and cache awareness become more of a big deal in the past decade or so, or has it only recently reached my awareness? I studied CS in the early 2000s and while I was aware of various trade-offs between contiguous and linked data structures, it's only in the last few years that the importance of cache (and non-"system" languages' widespread flouting of its importance) has really bubbled up in my consciousness.

It's certainly likely that I just didn't take the right kinds of classes, or that I did but just missed or failed to really internalize this, but I'm curious if something else is going on. Has the bottleneck increasingly become memory rather than CPU? Have we more thoroughly exhausted other areas of optimization?

I was thinking about this while watching a talk[0] given by someone from Google about data structures in the STL, which decried that the standardized specification for `std::unordered_map` more or less forces an implementation to use buckets instead of the more cache-friendly technique of probing. I'm sure that the people who standardized it were very competent, which makes me think that, at least at the time, buckets were thought to be the better implementation.

[0]: https://www.youtube.com/watch?v=fHNmRkzxHWs

Edit: Added video link.

> Has the bottleneck increasingly become memory rather than CPU?

Yes.

Besides the basic memory latency vs. CPU clock speed issue, there's also:

- CPU caches have gotten bigger, so the payoff of making an program cache-friendly is larger

- CPUs have gained SIMD vector ALUs and increased superscalar behavior, so the CPU is capable of processing more data in per clock cycle

- A cultural shift happened after the freewheeling Moore's Law era ended in the mid-2000s, and people started examining performance more closely

http://gec.di.uminho.pt/discip/minf/ac0102/1000gap_proc-mem_...