| HN Mirror

Y	Hacker News new \| ask \| show \| jobs

by SAI_Peregrinus 1815 days ago

For values that fit in a machine word, there are adder and multiplier designs that make the difference irrelevant. For larger values, or with some other adder/multiplier designs with different trade-offs, LE is dramatically faster.

Specifically, the problem is with "carries". When you're adding (or subtracting, or multiplying, or dividing, I'll just discuss adding) two binary values you might have to carry a 1 to the next place.

If you've got a BE value and an adder stage smaller than that value (say, a 32-bit number and 1-bit adder stages) you have to carry a 1 many times to output the result. If you're receiving the value in BE order 1 bit at a time you can't start the computation until you have the LSBs of both values, since if they're both 1 they'll affect the second bit by their carry. So you're stuck waiting for the entire value to start the computation. Further, during the computation you have to wait for every 1-bit adder in sequence.

There are "fast" adder designs that don't have to wait for every bit, but can instead work on groups of multiple bits with a carry-out at the end of the group. So if you've got an 8-bit group size, you'd have at most 3 carry delays during the computation of the output. For BE, you'd have to wait for all 4 bytes to be received, then wait for all the carry delays. For LE, you can start the computation as soon as the first byte is received, saving some time.

The larger the adder group size the more die area is needed, the stronger the drive strength of the transistors needs to be (bigger fan-out), and the slower the maximum clock of the overall system. On the other hand the bigger the group size the fewer carry delays, so addition can take fewer cycles. Most CPUs and MCUs implement single-cycle addition of their word size. Some CPUs even implement single-cycle multiplication at their word size.

1 comments

monocasa 1815 days ago

On pretty much anything Linux is running on, full words are loaded into registers before a multiply begins. Even for say, something like x86 where a multiply can have a memory argument and say that it straddles a cache line boundary so you could get a portion of the word, the system still splits it into load to (temporary) register, execute mul, and store to memory micro-ops.

link

SAI_Peregrinus 1815 days ago

Correct. There are a few cases where the operands don't fit into a single machine word, the most notable being many cryptographic operations. Particularly RSA and ECC, which involve multiple-precision arithmetic.

There are also non-Linux cases, mostly microcontrollers. EG the Arm Cortex M0 doesn't have a hardware multiplier, the M0+ does.

And then there's that one guy who got Linux running on an 8-bit AVR by emulating a 32-bit ARM and running it on that[1]. I'd consider this a silly edge case. Too fun not to mention though.

[1] https://dmitry.gr/?r=05.Projects&proj=07.%20Linux%20on%208bi...

link