Presumably there is a pass to turn f(recur(x), y) into recur(f(y, acc), x) and then tail-call optimization can be applied. This works for any associative f.
It's just vectorizing calculations so that it's faster than pure iterative calculation. 64 bit version doesn't get this probably because that optimizer isn't SSE2 aware yet (just a guess I actually don't know) and can't do SIMD arithmetic with 2 64 bit floats
(1 * 5 * 9 * 13 * 17 * ...) * (2 * 6 * 10 * 14 * 18 * ...) * (3 * 7 * 11 * 15 * 19 * ...) * (4 * 8 * 12 * 16 * 20 * ...)
(it's not precisely that, but close enough)
It isn't optimal however, optimal code would be pre-computed results (signed integer overflow is undefined, so n <= 12 is defined)
(if signed integer overflow would be defined to be 2's complement overflow, you can still use a table, as n > 33 gives 0)