I ended up writing one in the meantime -- check out the end of the readme. For this workload it seems to be a factor of 30x. The C program gets away with no heap allocation and inlining the entire render loop, though Lean does do quite a lot of unboxed float arithmetic. I tried to optimize the Lean and only shaved 25% off the runtime (code in a branch).
I suspect that it could be made a bit faster if each thread had its own copy of the scene. Lean handles reference counting of shared objects differently.
I suspect that it could be made a bit faster if each thread had its own copy of the scene. Lean handles reference counting of shared objects differently.