Hacker News new | ask | show | jobs
by eejjjj82 1816 days ago
Best reply from Linus thus far

  ...
  Also, why is this in big-endian order?
  
  Let's just try to kill big-endian data, it's disgusting 
  and should just die already.

  BE is practically dead anyway, we shouldn't add new cases. Networking
  has legacy reasons from the bad old days when byte order wars were
  still a thing, but those days are gone.
https://lore.kernel.org/lkml/CAHk-=wisMFiBHT7dLFOtHqX=fEve3J...
4 comments

Is there a technical reason why little endian is better besides the fact that it's more popular due to x86/ARM?

To me, big endian makes a lot more sense - integers are stored in the order which you read them (mentally).

I do remember reading about how little endian enabled some type of optimization inside the CPU, but I forget the specifics.

For values that fit in a machine word, there are adder and multiplier designs that make the difference irrelevant. For larger values, or with some other adder/multiplier designs with different trade-offs, LE is dramatically faster.

Specifically, the problem is with "carries". When you're adding (or subtracting, or multiplying, or dividing, I'll just discuss adding) two binary values you might have to carry a 1 to the next place.

If you've got a BE value and an adder stage smaller than that value (say, a 32-bit number and 1-bit adder stages) you have to carry a 1 many times to output the result. If you're receiving the value in BE order 1 bit at a time you can't start the computation until you have the LSBs of both values, since if they're both 1 they'll affect the second bit by their carry. So you're stuck waiting for the entire value to start the computation. Further, during the computation you have to wait for every 1-bit adder in sequence.

There are "fast" adder designs that don't have to wait for every bit, but can instead work on groups of multiple bits with a carry-out at the end of the group. So if you've got an 8-bit group size, you'd have at most 3 carry delays during the computation of the output. For BE, you'd have to wait for all 4 bytes to be received, then wait for all the carry delays. For LE, you can start the computation as soon as the first byte is received, saving some time.

The larger the adder group size the more die area is needed, the stronger the drive strength of the transistors needs to be (bigger fan-out), and the slower the maximum clock of the overall system. On the other hand the bigger the group size the fewer carry delays, so addition can take fewer cycles. Most CPUs and MCUs implement single-cycle addition of their word size. Some CPUs even implement single-cycle multiplication at their word size.

On pretty much anything Linux is running on, full words are loaded into registers before a multiply begins. Even for say, something like x86 where a multiply can have a memory argument and say that it straddles a cache line boundary so you could get a portion of the word, the system still splits it into load to (temporary) register, execute mul, and store to memory micro-ops.
Correct. There are a few cases where the operands don't fit into a single machine word, the most notable being many cryptographic operations. Particularly RSA and ECC, which involve multiple-precision arithmetic.

There are also non-Linux cases, mostly microcontrollers. EG the Arm Cortex M0 doesn't have a hardware multiplier, the M0+ does.

And then there's that one guy who got Linux running on an 8-bit AVR by emulating a 32-bit ARM and running it on that[1]. I'd consider this a silly edge case. Too fun not to mention though.

[1] https://dmitry.gr/?r=05.Projects&proj=07.%20Linux%20on%208bi...

When you add numbers together you start at the little end. Imagine a bignum implementation with multi-word numbers. Now imagine you have a bunch of them in a file you want to add up. If the numbers are in little endian order you can do a streaming implementation that reads and adds at the same time. If they are in big endian order you need to read a whole number before it can be added to the accumulator.

Obviously this example is very contrived. This sort of thing was much more of a concern on 8-bit computers. But little endian still seems more natural to me.

That makes sense. It would also help from a cache prefetching perspective.
With little endian, if you take a pointer to an integer of a large type (eg. uint64_t) where the value fits in a smaller type (eg. uint32_t), you will get same correct value accessing it as a uint64_t or uint32_t on a little endian system. This can make integer type conversion/casting slightly more efficient, and simplify code a bit.
One small argument for little endian is that if you have a pointer void* then in little endian format you can interpret it as a int8_t*, int16_t*, etc. (or char*, short*, etc. in old money) and get the same numerical value if the number is small enough that it fits into all the types you try. I don't think that has much practical use but it does have a nice feel about it.
Sounds like a footgun to me
It means for example that a bitmap is the same no matter if its code accesses it in groups of 8/16/32/64 bits.
I get it, but it also means values will appear to be correct, instead of obviously wrong, if the data is cast without concern for the value range. Then one day someone enters a value and exceeds that range and BOOM
>> Is there a technical reason why little endian is better besides the fact that it's more popular due to x86/ARM? To me, big endian makes a lot more sense

Some of the other replies have minor technical reasons, but I've always preferred big endian for the readability. Having said that, I'm happy to part with the idea of big endian if it means an end to having 2 options to worry about. One thing that bothers me a lot about RISC-V is that the standard claims to allow either big or little endian implementations. Little has won and nothing new should support big endian IMHO. The benefits of either are largely irrelevant, but the existence of both is a problem. Or maybe it's that the existence of code that cares is the real problem ;-)

Besides what others have said, little endian is more natural: the byte at offset b has value 256**b, instead of 256**(n - b - 1).
I agree that big endian makes more sense, but I think that particular ship has long since sailed.
Also note that that was the first reply, minutes after the patches were posted.

I'm half convinced that he was racing to be first to give it a seal of partial-approval to keep comments more on track.

If that’s true, it would be extremely good community management on his part.
> Networking has legacy reasons from the bad old days when byte order wars were still a thing, but those days are gone.

Yeah, this isn't true. Low level hardware receiving data still likes to use shift registers:

    * Zero shift reg
    * clock in one byte, shift into shift register
    * Clock in next byte, shift into shift register shifting the previous byte left one
    * Repeat for as many bytes as you have
If you want this to work for a variable number of bytes, then you need most significant byte first, so that everything more significant is pre-zeroed. This is not theoretical - we did this for an FPGA network offload thing last year.

Processors can't agree on endinaness, but network protocols have. "Network byte order" is a standard thing that is almost completely universal across communication protocols.

This kind of misses the point of his comment though. There's now tons of ASICs and FPGAs in networking gear that relies on network byte order for optimal performance. No one's advocating for changing that.

What he is saying is that for pretty much everything else (read: typical CPUs) it's all little endian now.

I am out of my depth here, but I do not really understand why this matters. Could you elaborate?

Basically, I understand that shifts for incoming serial bits is very convenient. But as long as you specify the "word size" used in your local memory, I do not see the problem: just use 8 (for instance) shifts for your 8-bit word and then go to the next memory location. Why is this fantasy wrong?

>BE is practically dead anyway.

It is the default byte order in Java bytecode.