Hacker News new | ask | show | jobs
by weberc2 2618 days ago
That's a big jump between OCaml and Go. I'm not familiar with ray tracing, but skimming the source code it mostly looks like it's doing floating point math; it doesn't look like it's using the runtime (no allocations, no virtual function calls, no scheduling, etc), so I'm surprised that Go is performing relatively poorly.

I wonder if the performance gap is attributable to some overhead in Go's function calls? I know Go passes parameters on the stack instead of via registers... Maybe it's due to passing struct copies instead of references (looks like the C version passes references)? Generally poor code generation?

Anyone else have ideas or care to profile?

EDIT: From my 2015 MBP, Go (version 1.12) is indeed quite a lot slower than C, but only if you're doing an optimized build `-03`:

    tmp $  time ./gorb
    real    1m15.128s
    user    1m9.366s
    sys     0m6.754s

    tmp $  clang crb.c
    tmp $  time ./a.out
    real    1m13.041s
    user    1m10.284s
    sys     0m0.624s

    tmp $  gcc crb.c -o crb -std=c11 -O3 -lm -D_XOPEN_SOURCE=600
    tmp $  time ./crb
    real    0m22.703s
    user    0m22.550s
    sys     0m0.073s

    tmp $  clang crb.c -o crb -std=c11 -O3 -lm -D_XOPEN_SOURCE=600
    tmp $  time ./crb
    real    0m22.689s
    user    0m22.564s
    sys     0m0.060s
EDIT2: I re-modified the Go version (https://gist.github.com/weberc2/2aed4f8d3189d09067d564448367...) to pass references and that seems to put it on par with C (or I mistranslated, which is also likely):

    $ time ./gorb 
    real    0m19.282s
    user    0m14.467s
    sys     0m7.523s
4 comments

There's a variety of possibilities. Lerc mentions GC as one possibility, which could definitely be the case. Another one that would be high on my "first guess" list is that everything above it has much better optimizers, and raytracing code is one of the places this is really going to show. Go does basically very little optimization, because it prioritizes fast compilation.

(Where Go "wants" to play is that same benchmark, except including compilation time.)

A couple of the things below Go I suspect are bad implementations. I would expect a warmed-up C# to beat Go if both have reasonable (not super-crazy optimized implementations) or at least be at parity, and Luajit may also be a slow implementation. In both cases because ray-traced code is a great place for a JIT to come out and play. EDIT: Oh, I see C# is Mono, and not the Windows implementation. In that case that makes sense.

Oh, and I find it helpful to look at these things logarithmically. I think it matches real-world experiences somewhat better, even though we truly pay for performance in the linear world. From that perspective, it's still only the second largest. The largest is Haskell to Elixir, which is substantially larger. O'Caml->Go is large, but not crazily so; several other differences come close.

There are multiple ”levels” of performance in play here, and which level a language performs on depends on the language, runtime and implementation.

The most naive level is e.g allocating heap objects for vectors, rays etc. On that level the algorithm is probably bounded by pointed chasing, cache misses and GC.

The next level up is an allocation-free loop (at least)

The best level is an optimized and allocation free. If the implementation isn’t allowed to optimize (use SoA instead of AoS, manually vectorize, unroll etc) then the winning languages will be the ones that have sophisticated compilers such as those with LLVM backends.

As an example: The C# example should be on the second level here - but it has a poor implementation (looks like it’s ported from java or written by a java developer) so it’s actually stuck on the first naive level.

Like I responded to Lerc, I don't see any allocs in the hot path here.

Also, as I edited, I updated the Go version to pass by reference and that put it on par with C (and also per my update, I may have mistranslated somehow).

Profiling results:

- initial time: 49s

- replacing default thread-safe RNG with rand.New sliced 6 seconds off. (the default RNG uses mutexes), = 43s

- use float64 instead of float32 and remove many type conversions. Another two seconds off. = 41s.

As others suggested, go still lacks many compile time optimizations and the implementation could be improved.

I did some profiling too--I shaved off 10-15s by using the C xorshift implementation instead of rand.Float32() (which spends a lot of time locking a mutex).
I'm not familiar with go but I seem to recall it is garbage collected.

If so it may be something to do with the creation of new vectors on the heap instead of the stack. The compiler would have to determine the full lifetime of the vector value to be able to bump it to the stack. That's an optimization, and sometimes it's just not possible (but probably is here).

In the C instance no such optimization is necessary. They ask for it on the stack, and if you try to use it after the stack is gone, Bad Things™ happen.

A hypothesis to test would be that languages above the jump are managing to work on the stack and ones below are allocating objects on the heap.

Is it creating vectors in the hot path? I'm not seeing it. Go does some escape analysis (the optimization you're referring to), but it's pretty conservative.
All of the vector operations return new vectors instead of mutating.

    func v3_add (a, b v3) v3 {
     return v3{x: a.x + b.x, y: a.y + b.y, z: a.z + b.z}
    }
A sensible approach for maintainable code but without knowing if a and b are used elsewhere the compiler can't reuse the space they occupy. If C can't figure it out, it can just stick a new one on the stack which doesn't cost too much.

This is, of course, Rust's bread and butter which is probably why it takes the top spot.

Those are copied on the stack, not heap allocated, so the GC wouldn't come into play.
Probably because Go version is different, compare C vector operations with Go vector operations. C version is operating on pointer without allocating new vector, Go version is allocation new vector on every op.

EDIT: Look at C code assembly, it's generating mostly SIMD instructions and using xmm registers. That's why it's faster. Golang compiler still do not have autovectorization implemented that's why it's so much slower in this case.

EDIT2: It seems Go version also uses SSE here, which is nice. So probably unnecessary allocation from my original post was the reason.

I modified the Go version to pass references (see my second edit) and that made up the difference (or I mistranslated).
Ah, you replied to my comment before the edit, about unnecessary allocation in Go vec handling.
I'm 99% sure the Go version doesn't allocate any vectors; afaict, it's passing everything on the stack.