Hacker News new | ask | show | jobs
by kenhwang 2618 days ago
Ordered by realtime, fastest to slowest for those like me who got annoyed by the scrolling up and down trying to compare:

  Rust (1.13.0-nightly)         1m32.392s
  Nim (0.14.2)                  1m53.320s
  C                             1m59.116s
  Julia (0.4.6)                 2m01.166s
  Crystal (0.18.7)              2m01.735s
  C Double Precision            2m26.546s
  Java (1.7.0_111)              2m36.949s
  Nim Double Precision (0.14.2) 3m19.547s
  OCaml                         3m59.597s
  Go 1.6                        6m44.151s
  node.js (6.2.1)               7m59.041s
  node.js (5.7.1)               8m49.170s
  C#                           12m18.463s
  PyPy                         14m02.406s
  Lisp                         24m43.216s
  Haskell                      26m34.955s
  Elixir                      123m59.025s
  Elixir MP                   138m48.241s
  Luajit                      225m58.621s
  Python                      348m35.965s
  Lua                         611m38.925s
5 comments

I rewrote the Go benchmark to be a mechanical translation of C and it performs much better.

    C        (gcc -O3):       23.8s
    Julia    (julia 1.1.0):   32.8s
    Go (alt) (go 1.12):       39.3s
    Java     (java 1.8.0_60): 44.2s
    Go (org) (go 1.12):       64.8s
    OCaml    (ocaml 4.07.1):  79.1s
    JS       (node 11.14.0):  137.0s
    Pypy     (pypy 6.0.0):    139.4s
    C#       (mono 4.2.1):    187.3s

    Rust: DOES NOT COMPILE
So Go is only twice as slow as C, not thrice as slow. This puts it just ahead of Java and just behind Julia.
Note that the Java impl is creating objects all over the place in the inner loop - madness!

I'm sure a "mechanical translation of [the] C" version would improve things for the Java ver as well. If we removed startup costs (the class file validation, etc) I'd expect it to be on par with C.

The startup costs are fixed (they don't increase linearly with the number of loop iterations) and for such a small program, they could not reasonbly explain more than 1s of the 20s gap between C and Java. Also, I don't think it's "allocating objects in an inner loop", because Java's allocations are super cheap (bump allocations) if the escape analyzer doesn't keep them on the stack in the first place.

That said, after examining this benchmark further, I don't think it's very good since the sequences returned by the random number generators are not controlled for (each implementation uses its own standard library RNG with their own seeds, so the sequences will vary from language to language). This likely causes more loop iterations, but considering the loop termination condition, the theoretical distribution of RNG outputs, and the trivial work done in the loop body, I doubt that the delta in loop iterations can explain any significant portion of the gap. Rather, I think the gap is simply a difference in performance of the RNGs themselves--C and Rust use a poor man's RNG (xorshift) which performs very well for this exercise but is not a good general purpose RNG (and standard library RNGs are optimized for the general case). When I rewrote the Go version, using the xorshift implementation made the most significant impact (15s), although I'm not 100% sure that the output of the RNG isn't just causing it to run the RNG less frequently. I opened up this ticket against the project: https://github.com/niofis/raybench/issues/15.

Julia was astonishing. It's a high level language that's performing almost like C.

Last time I checked, many years back, the spec was changing and the run time did crash. Guess it has gone a long way since.

The other one is Lua. My assumption was that it's one of the lightest and fastest language around. Looks like "fastest" isn't true in some cases.

Different languages' benchmarks might not be equally well-written / optimized. In particular, I'd expect C and Rust to be very close to each other, and a 20% gap between them is a red flag.

Rules like "code should be simple, as in, easy to read and understand" are also hard to judge, especially near the top of the list where there's a lot of pressure to optimize. Is SIMD easy to understand? What if it's in a library? What if the library was written specifically for this benchmark? Etc. I think https://benchmarksgame-team.pages.debian.net/benchmarksgame/ has to deal with every possible permutation of this debate.

Not necessarily. Because C allows the pointer manipulation, the compiler can in general not make assumptions about pointer aliasing. This prevents some optimizations.

In Rust, the compiler has more information/control over memory layout/lifetime and can therefore make stronger optimizations.

Automatic vectorization is an area where this helps a lot, and raytracing can benefit a lot here. 20% sounds reasonable to me.

Well, generally, perhaps. But any performance oriented C programmer worth his or her salt would be aware of aliasing issues and write code in such a way that it doesn't cause problems for the compiler. Plus, this is a toy benchmark of a few hundred lines so the compiler can do full-program analysis. So the 20% difference is indeed a smell.

Looking at the crb*.c files, structs are passed as pointers and not by value. This makes it harder for the compiler to analyze the data flow which I would bet is part of the reason Rust is faster here.

> pointer aliasing

Unfortunately, due to LLVM bugs, the Rust developers had to disable that optimization, more than once. I don't know whether the "1.13.0-nightly" he used has that optimization enabled or disabled. (See https://github.com/rust-lang/rust/issues/31681 and https://github.com/rust-lang/rust/issues/54878 for the relevant Rust issues.)

That's a good point, I didn't think about the effect on autovectorization. Do you think that's what's happening here? My impression was that getting good vector code out of the compiler usually requires manually tuning things.
shouldn't rust being faster than C be something of a red flag that they aren't quite the same algorithm? Or that the algorithm is sub-optimal?
C and Rust have been trading blows on the language benchmark games for a while now which dictates the algorithm used. From my experience, it's relatively easy to accidentally write fast Rust, but incredibly hard to write fast C.

https://benchmarksgame-team.pages.debian.net/benchmarksgame/...

It got me thinking as well, so I ventured and did some experiments on this, and found that the main difference is the algorithm used for the RNG; C's std lib uses a slower one (which also is thread safe, and butchered OpenMP performance). You can take a look at a more apples to apples comparison in the latest update for crb.c which uses a xor128 rng; rust is still a little faster (especially when going multithreaded), but not quite the difference in the README file, still need to get some time to update it.
fwiw, I looked at some of the quicker c/rust examples, without too much other analysis

  crb-vec-omp //I added some #pragma omp to crb-vec
  executable size:
    18k
  time:
    real 0m3.630s
  valgrind: 
    ==17703== HEAP SUMMARY:
    ==17703==     in use at exit: 7,408 bytes in 15 blocks
    ==17703==   total heap usage: 20 allocs, 5 frees, 
  14,790,856 bytes allocated

  rsrb_alt_mt.rs
  executable size:
      426k
  time: 
    real 0m1.630s
  valgrind: 
    ==7221== HEAP SUMMARY:
    ==7221==     in use at exit: 43,120 bytes in 216 blocks
    ==7221==   total heap usage: 256 allocs, 40 frees, 
  11,113,784 bytes allocated
and because we have a number of tiny single cpu vm's out there (which would also benefit from a performant language) I gave it a shot there:

  :~# time ./rsrb_alt_mt
  ./rsrb_alt_mt: /lib64/libc.so.6: version `GLIBC_2.18' not 
  found (required by ./rsrb_alt_mt)

  real 0m0.002s
  user 0m0.002s
  sys 0m0.000s

  :~# time ./crb-vec-omp 

  real 0m24.234s
  user 0m24.160s
  sys 0m0.035s
so rust appears broken on centos 7.5 (no, I'm not going to edit the binary). But that is an insta-deal breaker for us.
Have you tried passing everything by value? That is, instead of:

    bool hit_sphere(const struct sphere* sp, const struct ray* ray, struct hit* hit)
you write:

    static bool
    hit_sphere(struct sphere sp, struct ray ray, struct hit hit)
IME, clang is insanely good at optimizing pass by value calls.
fwiw, I took a stab at replacing a bunch of => with . and ran it: time ./crb-vec-omp real 0m0.764s

which is more than twice as fast as the rust example,

but it didn't create the right output...

if you get bored would you mind taking a stab at adding parallel and modifying crb-vec.c https://github.com/niofis/raybench , I definitely think you might be on to something here.

shouldn't rust being faster than C be something of a red flag that they aren't quite the same algorithm? Or that the algorithm is sub-optimal?

The difference isn't much. And Rust is more like FORTRAN. Maybe a bit faster than C, but can't do the gymnastics with pointers that C can.

> can't do the gymnastics with pointers that C can

It can if you write the "unsafe" keyword, but there's a pretty strong community norm around not doing that sort of thing, unless you can encapsulate it inside some sort of safe API. And to be fair to C, I think C can close the gap with Rust/Fortran if you use the "restrict" keyword a lot?

With unsafe, you can do anything that C can.

Without unsafe, there’s significantly more aliasing information, which helps optimizations.

Rust is compiled by LLVM, while C compiled by GCC, which is a bit conservative. It's possible to enable same optimizations for gcc and LLVM, so their speed will match.
Wasn’t Julia specifically designed to be easy to optimise? It’s not quite like other higher level languages a they thought about performance first.
That's using version 0.4 of Julia too (current is 1.1). Current version has a a lot of improved optimization passes that would likely benefit this benchmark.
That's a big jump between OCaml and Go. I'm not familiar with ray tracing, but skimming the source code it mostly looks like it's doing floating point math; it doesn't look like it's using the runtime (no allocations, no virtual function calls, no scheduling, etc), so I'm surprised that Go is performing relatively poorly.

I wonder if the performance gap is attributable to some overhead in Go's function calls? I know Go passes parameters on the stack instead of via registers... Maybe it's due to passing struct copies instead of references (looks like the C version passes references)? Generally poor code generation?

Anyone else have ideas or care to profile?

EDIT: From my 2015 MBP, Go (version 1.12) is indeed quite a lot slower than C, but only if you're doing an optimized build `-03`:

    tmp $  time ./gorb
    real    1m15.128s
    user    1m9.366s
    sys     0m6.754s

    tmp $  clang crb.c
    tmp $  time ./a.out
    real    1m13.041s
    user    1m10.284s
    sys     0m0.624s

    tmp $  gcc crb.c -o crb -std=c11 -O3 -lm -D_XOPEN_SOURCE=600
    tmp $  time ./crb
    real    0m22.703s
    user    0m22.550s
    sys     0m0.073s

    tmp $  clang crb.c -o crb -std=c11 -O3 -lm -D_XOPEN_SOURCE=600
    tmp $  time ./crb
    real    0m22.689s
    user    0m22.564s
    sys     0m0.060s
EDIT2: I re-modified the Go version (https://gist.github.com/weberc2/2aed4f8d3189d09067d564448367...) to pass references and that seems to put it on par with C (or I mistranslated, which is also likely):

    $ time ./gorb 
    real    0m19.282s
    user    0m14.467s
    sys     0m7.523s
There's a variety of possibilities. Lerc mentions GC as one possibility, which could definitely be the case. Another one that would be high on my "first guess" list is that everything above it has much better optimizers, and raytracing code is one of the places this is really going to show. Go does basically very little optimization, because it prioritizes fast compilation.

(Where Go "wants" to play is that same benchmark, except including compilation time.)

A couple of the things below Go I suspect are bad implementations. I would expect a warmed-up C# to beat Go if both have reasonable (not super-crazy optimized implementations) or at least be at parity, and Luajit may also be a slow implementation. In both cases because ray-traced code is a great place for a JIT to come out and play. EDIT: Oh, I see C# is Mono, and not the Windows implementation. In that case that makes sense.

Oh, and I find it helpful to look at these things logarithmically. I think it matches real-world experiences somewhat better, even though we truly pay for performance in the linear world. From that perspective, it's still only the second largest. The largest is Haskell to Elixir, which is substantially larger. O'Caml->Go is large, but not crazily so; several other differences come close.

There are multiple ”levels” of performance in play here, and which level a language performs on depends on the language, runtime and implementation.

The most naive level is e.g allocating heap objects for vectors, rays etc. On that level the algorithm is probably bounded by pointed chasing, cache misses and GC.

The next level up is an allocation-free loop (at least)

The best level is an optimized and allocation free. If the implementation isn’t allowed to optimize (use SoA instead of AoS, manually vectorize, unroll etc) then the winning languages will be the ones that have sophisticated compilers such as those with LLVM backends.

As an example: The C# example should be on the second level here - but it has a poor implementation (looks like it’s ported from java or written by a java developer) so it’s actually stuck on the first naive level.

Like I responded to Lerc, I don't see any allocs in the hot path here.

Also, as I edited, I updated the Go version to pass by reference and that put it on par with C (and also per my update, I may have mistranslated somehow).

Profiling results:

- initial time: 49s

- replacing default thread-safe RNG with rand.New sliced 6 seconds off. (the default RNG uses mutexes), = 43s

- use float64 instead of float32 and remove many type conversions. Another two seconds off. = 41s.

As others suggested, go still lacks many compile time optimizations and the implementation could be improved.

I did some profiling too--I shaved off 10-15s by using the C xorshift implementation instead of rand.Float32() (which spends a lot of time locking a mutex).
I'm not familiar with go but I seem to recall it is garbage collected.

If so it may be something to do with the creation of new vectors on the heap instead of the stack. The compiler would have to determine the full lifetime of the vector value to be able to bump it to the stack. That's an optimization, and sometimes it's just not possible (but probably is here).

In the C instance no such optimization is necessary. They ask for it on the stack, and if you try to use it after the stack is gone, Bad Things™ happen.

A hypothesis to test would be that languages above the jump are managing to work on the stack and ones below are allocating objects on the heap.

Is it creating vectors in the hot path? I'm not seeing it. Go does some escape analysis (the optimization you're referring to), but it's pretty conservative.
All of the vector operations return new vectors instead of mutating.

    func v3_add (a, b v3) v3 {
     return v3{x: a.x + b.x, y: a.y + b.y, z: a.z + b.z}
    }
A sensible approach for maintainable code but without knowing if a and b are used elsewhere the compiler can't reuse the space they occupy. If C can't figure it out, it can just stick a new one on the stack which doesn't cost too much.

This is, of course, Rust's bread and butter which is probably why it takes the top spot.

Those are copied on the stack, not heap allocated, so the GC wouldn't come into play.
Probably because Go version is different, compare C vector operations with Go vector operations. C version is operating on pointer without allocating new vector, Go version is allocation new vector on every op.

EDIT: Look at C code assembly, it's generating mostly SIMD instructions and using xmm registers. That's why it's faster. Golang compiler still do not have autovectorization implemented that's why it's so much slower in this case.

EDIT2: It seems Go version also uses SSE here, which is nice. So probably unnecessary allocation from my original post was the reason.

I modified the Go version to pass references (see my second edit) and that made up the difference (or I mistranslated).
Ah, you replied to my comment before the edit, about unnecessary allocation in Go vec handling.
I'm 99% sure the Go version doesn't allocate any vectors; afaict, it's passing everything on the stack.
Python gets 20% faster if you use `__slots__` on the Vector class which is created and destroyed millions of times. It's still the second-slowest, but it's a nice improvement :P
I wouldn’t give much weight to those benchmark numbers at this point. Some of those language versions are quite out of date...
The numbers are indeed out of date. I re-ran the tests for some of the C and Nim programs using Nim 0.19.4 and gcc 7.3.0 on Windows 10. Here are the results:

  crb-omp      0m22.949s
  crb          0m23.240s
  crb_opt      0m31.404s

  nimrb_pmap   0m8.828s
  nimrb_fn     0m22.988s
  nimrb        0m26.556s
The base C code is faster than base Nim code. The optimised C code is significantly slower than everything else(!?). The Nim program that uses a threadpool is the fastest of these.

I couldn't get the CPP versions to compile. I'll do the Rust programs some other time.

If you are using a current version of your C compiler and can consistently reproduce those numbers just by adding something like -O3, you should probably file a bug report. Optimizers can go wrong and sometimes pessimize things a bit, but an almost 50% slowdown from enabling optimizations would be treated as an important bug to fix. (Though people are saying that significant amounts of time in the benchmark are spent in random number generation and output, so maybe try to find out first if the problem lies in one of those.)
Ah, nevermind, (a) even the "non-optimized" C version uses -O3, and (b) the "C" and the "optimized C" programs differ not only in compiler flags but they are actually different source codes. Specifically, the "optimized C" version doesn't use the faster random number generator.

If you fix that, on my machine it's 17.3 seconds for the base C version and 13.4 for the optimized one, i.e., a 22% improvement from turning on the extra optimizations (-march=native and -ffast-math).

And for whatever it's worth, because some people love hating on GCC in favor of Clang, my Clang timings are 19.5 and 16.5 seconds, respectively.

It's surprising that the optimized C is slower; it's much faster on my machine. Like 20s (optimized C) vs 1m20s (unoptimized C).
I would imagine current versions would only give the newer, actively developed languages a boost like Rust, Nim, Crystal, and Go. Java and C I don't see improving moving much.
The node version is very old. V8 has improved significantly since v6.