| HN Mirror

No, he improved it since the version TonyLobster tested. Quoting:

It's 2023 - is your 16x16=32 bit unsigned multiply under 190 cycles yet? What? No?!? Let me help you out buddy! Much to my surprise, I have surpassed by previous best by 10.5 cycles or 5%, when I thought it couldn't be optimized further. How is this possible you ask? The first trick had to do with how cycles are counted. As a shortcut, we usually average branches/boundary crossings. However, some carries happen only 7% of the time, so optimizing for the average case actually slows you down. You should be optimizing for the no carry case. The other trick was playing with using registers to accumulate results. It wasn't as simple as it sounds; the order of multiplies had to be optimized to leave a target value in the A register.

The goal: multiply the two 16-bit numbers in x0, x1 and y0, y1 (low bytes and high bytes) and return the results in the fastest way possible (which may leave some results in registers, preferably in a useful way like the high bytes). I measure the timing in this convention to expose a timing which could be used in-line, as is often the case for demo effects. In this case, the 3rd byte is returned in A, which could be useful for a multiple accumulate where only the high 16-bits are kept.

The time is 188.1 cycles, averaged over all possible inputs. The (accurate) time of my version on codebase is 198.6

To be clear, z=xy or (z3, z2, z1, z0) = (x1, x0) (y1, y0).

;This section performs the 4 sub-multiplies to form ; the partials to be added later in self-mod code. ;Addresses like "x1y0l+1" refer to low(x1*y0) stored ; to the argument of a later "adc #{value}" ;mult8_set_mier_snippet {multiplier} ;mult8_snippet {multiplicand}, {low result}, {high result} +mult8_set_mier_snippet x1 ;17 +mult8_snippet y0, x1y0l+1, x1y0h+1 ;35 +mult8_snippet y1, x1y1l+1, z3 ;32 +mult8_set_mier_snippet x0 ;17 +mult8_snippet "Y", x0y1l+1, "X" ;28 +mult8_snippet y0, z0, "A" ;28 ;results in X=x0y1h and A=x0y0h ;multiply part total: 149-165, average 157 ;z3 z2 z1 clc x0y1l adc #0 ;x0y0h + x0y1l tay ;6 txa x1y0h adc #0 ;x0y1h + x1y0h tax bcc + ;9 inc z3 clc ;(+6) taken 7% of the time + tya x1y0l adc #0 ;+ x1y0l sta z1 ;7 txa x1y1l adc #0 ;+ x1y1l bcc done ;7 inc z3 ;(+4) taken 42% of the time

Column 1 takes 13 cycles, column 2 takes 16 cycles, and column 3 takes 2.1 cycles. The total time to add the partial products is 29-39 with an average of 31.1. The results are returned as a 32-bit result: z3, A(=z2), z1, z0. The total time is 156.96875+31.1=188.06875 (178-204).