I saw the openmp pragma and thought to myself "neat! should be fun to watch the cores work hard at this" and went ahead and compiled and run it and smiled at the 400% cpu usage in top.
$ time ./tinykaboom
./tinykaboom 78.08s user 0.02s system 369% cpu 21.159 total
Then I wondered how it would fare if I were to port it to Go and went ahead and hastily did port to Go and thought that, "hmmm this should run a bit slower than the c++ version" but surprisingly it ran more than twice faster:
$ go build ./tinykaboom.go
$ time ./tinykaboom
./tinykaboom 34.32s user 0.03s system 368% cpu 9.315 total
There are a few potential improvements here:
1) Use a look up table for 'sin' rather than using 'std::sin'.
2) Tell the compiler what instruction sets to use; for example, tell GCC to use 'skylake' instructions (https://gcc.gnu.org/onlinedocs/gcc-6.2.0/gcc/x86-Options.htm...).
3) Many of the functions could be 'inline constexpr'.
4) Although 'ofs <<' is buffered, it can still be very slow. Create the output in memory and use a lower level function like 'fwrite' to write it to file.
5) Use 'std::thread' or 'std::async'. It makes the multi-threading more portable and clear.
Weird result. I guess it makes sense that it could use twice as much CPU to finish in half the time but looking at the numbers doesn't feel intuitive.
I wonder how many shaders this would keep busy. There is probably a class of GPUs and above that this could work on rather well alongside an already large workload.
This technique is used a lot in demoscene demos, which certainly do run in realtime.