Hacker News new | ask | show | jobs
by dastx 1921 days ago
> Next, we have madd x0, x0, x0, x8. madd stands for “multiply-add”: it squares x0, adds x8, and stores the result in x0.

I'm curious (I know very little about assembly or what's on the CPU so pardon me if what I'm asking makes no sense), what's the benefit of having a whole instruction that both multiplies and adds? Is there a logical gate on the processor that does this? Or is this just going through a binary multiplier before going through an adder? Does what I'm asking even make any sense?

9 comments

In general, having a fused instruction is beneficial for performance in that it gives you code size savings which helps with respect to the instruction cache. There are likely other microarchitectural benefits, but that is the obvious one. However, there is a limit to the number of instructions you can support efficiently, so you generally only want to add instructions that will be commonly used.

Multiply-add is a good choice because it corresponds to the relatively common operation of computing the address of a field of a struct in an array so you can operate on that field.

(e.g. &(points[5].x) is &points + (5 * sizeof(point)) + offsetof(point, x)).

Note also that the 'mul' instruction is described in the Arm docs the article links to as an "alias" of madd. That is, the CPU itself has no pure multiply-only insn at all, only a multiply-and-add. When you write 'mul' in assembly, the assembler turns it into a 'madd' where the register to add is XZR (the reads-as-zero register).

There are a fair number of insns in the A64 instruction set that make use of this trick to provide one flexible instruction that as a special case provides useful simpler functionality under an alias. (Register-to-register 'mov' being an alias of 'orr' is another.)

RISC-V similarly has a ton of these aliased instructions.
> relatively common operation of computing the address of a field of a struct in an array

This is only relatively common inside loops. Inside loops you will usually index with the loop counter or some other value that is derived from it linearly. Compilers will typically use induction variable arithmetic that doesn't involve multiplication.

There's a "fused multiply-add" numerical operation defined in IEEE FP provides the implementation a shortcut compared to the two separate instructions. It doesn't have the extra roundings applied in the intermediate results. The resulting extra accuracy can be a good or bad thing depending on whether you prefer reproducible results (vs other expressions of the algorithm) or more accuracy.
In terms of actual hardware, a general-purpose multiply+adder is actually just a multiplier with one extra row for the addend. NxN multipliers are implemented as an N-row addition (of one multiplicand shifted and masked by the bits of the other). One more row is very cheap compared to running two operations through an ALU that only has separate multiply and add hardware.

In general, outwith ALUs as well as in, it is very cheap to fold any (reasonable) number of additions and subtractions, even ones with constant left/right shifts/rotates to the addends and subtrahends, into multipliers.

Are you using the word "outwith" to be funny, or is this really idiomatic in some dialect? I've seen people using "within and without" to mean "inside of and outside of" but not in anything written in the last hundred years.
The word "outwith" is used in Scotland. I'm not Scottish, but I've heard Scots use the word, and Oxford English Dictionary has recent quotations for it from Scottish newspapers.
Very interesting, thank you.

> Scottish Twitter users 'shocked' after discovering the word 'outwith' is only used in Scotland [0]

[0] https://www.dailyrecord.co.uk/scotland-now/scottish-twitter-...

It's commonly used. There's a huge amount of equations that look like (a * b) + (c * d) + ... and so on. So if that's the operation you're doing, it saves an additional instruction and therefore instruction bandwidth and cache. Within actually doing the operation, the extra add is a very small amount of overhead.

Having looked in the ARM reference manual, the "MUL" instruction is just an alias for MADD with an addition of zero!

I can't find timings for this instruction with 30 seconds of googling, has anyone got a spec with instruction timings?

Apple M1, can do four fused multiply-adds per cycle with latency of 4 cycles. Interestingly enough it seems that the latency on the vector FMA is even lower. So it’s 16 float FMA per cycle.

Source: https://dougallj.github.io/applecpu/firestorm-simd.html

> Is there a logical gate on the processor that does this?

It’s an ALU, way more complex than a logic gate (of which it’s composed), but yes fused multiply-add units are standard on every modern CPU. In fact if your processor is recent (more so than Haswell) odds are good it only has FMA FP ALU, no pure adder or multiplier.

Pointer arithmetic uses it a lot. For example:

    struct X
    {
        float a;
        int b[10];
    };

    X x;
If you want to access x.b[3], then you have to add sizeof float to the address of x, and then add sizeof int times 3.
> I'm curious (I know very little about assembly or what's on the CPU so pardon me if what I'm asking makes no sense), what's the benefit of having a whole instruction that both multiplies and adds?

It's still commonly said that RISC processors are faster than CISC because they are "reduced", as in they have fewer instructions. But really it's very beneficial to add instructions that do a lot, if it's something that can easily be done in hardware and replaces several simpler ones.

Multiply-add is an example of one; others are bitfield extraction and rotation, SIMD shuffle, AES encryption, and some of the complex memory operands x86 and ARM have. I even still think x86's memcpy instruction is a good idea.

Here’s the relevant Wikipedia article, which has a decent explanation: https://en.wikipedia.org/wiki/Multiply%E2%80%93accumulate_op...
x86 also has a kind of specialized (or limited, if you like) fused add and multiply instruction that is used a lot: lea, or load effective address. It's really a fused shift and add, or two fused additions if you prefer. The extent to which this instruction appears in real compiled code should stand as proof for how useful a fused instruction is.