Hacker News new | ask | show | jobs
by anonymoushn 4494 days ago
You can go at least twice as fast if you use symmetries to prune the space. I messed with the C version:

  $ make
  gcc -O3 -march=core2 -msse4.1 -msse4.2 8q_C_bluecalm.c -o 8q_C_bluecalm
  time ./8q_C_bluecalm  ; echo $?
  
  real	0m0.376s
  user	0m0.373s
  sys	0m0.001s
  92
  time ./8q_x64_davidad ; echo $?
  
  real	0m0.129s
  user	0m0.128s
  sys	0m0.001s
  92
Of course this doesn't matter if the point is to compare the best asm dfs to the best C dfs of the same space, but if the point is to make the best asm 8queens solver, you can get down below 6μs like this :)

Edit: Using C++ templates to inline the DFS also helps

  time ./8q_C_bluecalm  ; echo $?
  
  real	0m0.260s
  user	0m0.258s
  sys	0m0.001s
  92
Unfortunately this sort of technique doesn't really apply to the asm >_>

Edit again: I tried to go a bit further by replacing the board struct with your method of passing 3 bytes between levels of the dfs, but at that point the compiler was able to optimize out the whole program.

Removing the templates causes the compiler to actually emit a dfs, which by now is faster than the asm (!). Removing the 2x speedup from just not looking at half of the tree slows us back down to ~1.1x the runtime of the asm:

  time ./8q_C_bluecalm  ; echo $?
  
  real	0m0.140s
  user	0m0.138s
  sys	0m0.001s
  92
Here's this last solution, the one that's ~1.1x slower than the asm on my machine: http://pastie.org/8784768

And with the free 2x speedup: http://pastie.org/8784770

One last time now: I tried getting rid of the x arg and using a global. This was much slower, so I tried getting rid of the SOLUTIONS global and using a return value. This caused the compiler to optimize out the whole program again :(

2 comments

Wow! I like your C solution. I'm glad that at least the speedup is still measurable, but it's a close shave.

I implemented your "free 2x speedup" in asm [1]. Sure enough, solution time improved to 4.6us on my machine [2].

[1] https://github.com/davidad/8queens/commit/61119a7f0019e4a85e... [2] https://github.com/davidad/8queens/tree/free-2x-speedup

It seems to me that using a bitboard instead of the struct is the biggest gain. I like your version, very elegant and terse.