You can go at least twice as fast if you use symmetries to prune the space. I messed with the C version: $ make
gcc -O3 -march=core2 -msse4.1 -msse4.2 8q_C_bluecalm.c -o 8q_C_bluecalm
time ./8q_C_bluecalm ; echo $?
real 0m0.376s
user 0m0.373s
sys 0m0.001s
92
time ./8q_x64_davidad ; echo $?
real 0m0.129s
user 0m0.128s
sys 0m0.001s
92
Of course this doesn't matter if the point is to compare the best asm dfs to the best C dfs of the same space, but if the point is to make the best asm 8queens solver, you can get down below 6μs like this :)Edit: Using C++ templates to inline the DFS also helps time ./8q_C_bluecalm ; echo $?
real 0m0.260s
user 0m0.258s
sys 0m0.001s
92
Unfortunately this sort of technique doesn't really apply to the asm >_>Edit again: I tried to go a bit further by replacing the board struct with your method of passing 3 bytes between levels of the dfs, but at that point the compiler was able to optimize out the whole program. Removing the templates causes the compiler to actually emit a dfs, which by now is faster than the asm (!). Removing the 2x speedup from just not looking at half of the tree slows us back down to ~1.1x the runtime of the asm: time ./8q_C_bluecalm ; echo $?
real 0m0.140s
user 0m0.138s
sys 0m0.001s
92
Here's this last solution, the one that's ~1.1x slower than the asm on my machine: http://pastie.org/8784768And with the free 2x speedup: http://pastie.org/8784770 One last time now: I tried getting rid of the x arg and using a global. This was much slower, so I tried getting rid of the SOLUTIONS global and using a return value. This caused the compiler to optimize out the whole program again :( |
I implemented your "free 2x speedup" in asm [1]. Sure enough, solution time improved to 4.6us on my machine [2].
[1] https://github.com/davidad/8queens/commit/61119a7f0019e4a85e... [2] https://github.com/davidad/8queens/tree/free-2x-speedup