Anyways I wonder why it's still so slow.
60-120 cycles sure looks like a CORDIC implementation, but perhaps not.