| HN Mirror

Y	Hacker News new \| ask \| show \| jobs


	by dastx 1921 days ago
	> Next, we have madd x0, x0, x0, x8. madd stands for “multiply-add”: it squares x0, adds x8, and stores the result in x0. I'm curious (I know very little about assembly or what's on the CPU so pardon me if what I'm asking makes no sense), what's the benefit of having a whole instruction that both multiplies and adds? Is there a logical gate on the processor that does this? Or is this just going through a binary multiplier before going through an adder? Does what I'm asking even make any sense?

9 comments

Veserv 1921 days ago

In general, having a fused instruction is beneficial for performance in that it gives you code size savings which helps with respect to the instruction cache. There are likely other microarchitectural benefits, but that is the obvious one. However, there is a limit to the number of instructions you can support efficiently, so you generally only want to add instructions that will be commonly used.

Multiply-add is a good choice because it corresponds to the relatively common operation of computing the address of a field of a struct in an array so you can operate on that field.

(e.g. &(points[5].x) is &points + (5 * sizeof(point)) + offsetof(point, x)).

pm215 1921 days ago

Note also that the 'mul' instruction is described in the Arm docs the article links to as an "alias" of madd. That is, the CPU itself has no pure multiply-only insn at all, only a multiply-and-add. When you write 'mul' in assembly, the assembler turns it into a 'madd' where the register to add is XZR (the reads-as-zero register).

There are a fair number of insns in the A64 instruction set that make use of this trick to provide one flexible instruction that as a special case provides useful simpler functionality under an alias. (Register-to-register 'mov' being an alias of 'orr' is another.)

Narishma 1918 days ago

RISC-V similarly has a ton of these aliased instructions.

tom_mellior 1921 days ago

> relatively common operation of computing the address of a field of a struct in an array

This is only relatively common inside loops. Inside loops you will usually index with the loop counter or some other value that is derived from it linearly. Compilers will typically use induction variable arithmetic that doesn't involve multiplication.

fulafel 1921 days ago

There's a "fused multiply-add" numerical operation defined in IEEE FP provides the implementation a shortcut compared to the two separate instructions. It doesn't have the extra roundings applied in the intermediate results. The resulting extra accuracy can be a good or bad thing depending on whether you prefer reproducible results (vs other expressions of the algorithm) or more accuracy.

JdeBP 1921 days ago

In terms of actual hardware, a general-purpose multiply+adder is actually just a multiplier with one extra row for the addend. NxN multipliers are implemented as an N-row addition (of one multiplicand shifted and masked by the bits of the other). One more row is very cheap compared to running two operations through an ALU that only has separate multiply and add hardware.

In general, outwith ALUs as well as in, it is very cheap to fold any (reasonable) number of additions and subtractions, even ones with constant left/right shifts/rotates to the addends and subtrahends, into multipliers.

Y_Y 1921 days ago

Are you using the word "outwith" to be funny, or is this really idiomatic in some dialect? I've seen people using "within and without" to mean "inside of and outside of" but not in anything written in the last hundred years.

bloak 1921 days ago

The word "outwith" is used in Scotland. I'm not Scottish, but I've heard Scots use the word, and Oxford English Dictionary has recent quotations for it from Scottish newspapers.

Y_Y 1921 days ago

Very interesting, thank you.

> Scottish Twitter users 'shocked' after discovering the word 'outwith' is only used in Scotland [0]

[0] https://www.dailyrecord.co.uk/scotland-now/scottish-twitter-...

JdeBP 1921 days ago

And Hacker News within just the past 5 days. (-:

* https://news.ycombinator.com/item?id=26438656

* https://news.ycombinator.com/item?id=26399177

pjc50 1921 days ago

It's commonly used. There's a huge amount of equations that look like (a * b) + (c * d) + ... and so on. So if that's the operation you're doing, it saves an additional instruction and therefore instruction bandwidth and cache. Within actually doing the operation, the extra add is a very small amount of overhead.

Having looked in the ARM reference manual, the "MUL" instruction is just an alias for MADD with an addition of zero!

I can't find timings for this instruction with 30 seconds of googling, has anyone got a spec with instruction timings?

ribit 1921 days ago

Apple M1, can do four fused multiply-adds per cycle with latency of 4 cycles. Interestingly enough it seems that the latency on the vector FMA is even lower. So it’s 16 float FMA per cycle.

Source: https://dougallj.github.io/applecpu/firestorm-simd.html

masklinn 1921 days ago

> Is there a logical gate on the processor that does this?

It’s an ALU, way more complex than a logic gate (of which it’s composed), but yes fused multiply-add units are standard on every modern CPU. In fact if your processor is recent (more so than Haswell) odds are good it only has FMA FP ALU, no pure adder or multiplier.

amelius 1921 days ago

Pointer arithmetic uses it a lot. For example:

    struct X
    {
        float a;
        int b[10];
    };

    X x;

If you want to access x.b[3], then you have to add sizeof float to the address of x, and then add sizeof int times 3.

astrange 1921 days ago

> I'm curious (I know very little about assembly or what's on the CPU so pardon me if what I'm asking makes no sense), what's the benefit of having a whole instruction that both multiplies and adds?

It's still commonly said that RISC processors are faster than CISC because they are "reduced", as in they have fewer instructions. But really it's very beneficial to add instructions that do a lot, if it's something that can easily be done in hardware and replaces several simpler ones.

Multiply-add is an example of one; others are bitfield extraction and rotation, SIMD shuffle, AES encryption, and some of the complex memory operands x86 and ARM have. I even still think x86's memcpy instruction is a good idea.

alexhutcheson 1921 days ago

Here’s the relevant Wikipedia article, which has a decent explanation: https://en.wikipedia.org/wiki/Multiply%E2%80%93accumulate_op...

jeffbee 1921 days ago

x86 also has a kind of specialized (or limited, if you like) fused add and multiply instruction that is used a lot: lea, or load effective address. It's really a fused shift and add, or two fused additions if you prefer. The extent to which this instruction appears in real compiled code should stand as proof for how useful a fused instruction is.