This isn't actually taking advantage of speculative execution that much. The only speculation here would be in the predicting the loop repeats, which loop unrolling would mostly negate for CPUs that don't do speculative execution.
The data dependency issue, however, would still be a punishing factor. You'd need a CPU that isn't superscalar, which does exist but is increasingly less common (even 2014's Cortex-M7 was superscalar, although it kinda sounds like ARM backed off on that for later Cortex M's?)
Also many low-end / embedded CPUs that are in-order will still do branch prediction.
Those are also CPUs were multiplication is most likely to be significantly more expensive, or not implemented in hardware at all (though almost everything has a multiplier these days).