| HN Mirror

Y	Hacker News new \| ask \| show \| jobs

by nyrikki 83 days ago

The linked to blog post in the OP explains this better IMHO [0]:

   If the data stream encodes values with byte order B, then the algorithm to decode the value on computer with byte order C should be about B, not about the relationship between B and C.

One cannot just ignore the big/little data interchange problem MacOS[1], Java, TCP/IP, Jpeg etc...

The point (for me) is not that your code runs on a s390, it is that you abstract your personal local implementation details from the data interchange formats. And unfortunately almost all of the processors are little, and many of the popular and unavoidable externalization are big...

[0] https://commandcenter.blogspot.com/2012/04/byte-order-fallac... [1] https://github.com/apple/darwin-xnu/blob/main/EXTERNAL_HEADE...

2 comments

adrian_b 83 days ago

To cope with data interchange formats, you need a set of big endian data types, e.g. for each kind of signed or unsigned integer with a size of 16 bits or bigger you must have a big endian variant, e.g. identified with a "_be" suffix.

Most CPUs (including x86-64) have variants of the load and store instructions that reverse the byte order (e.g. MOVBE in x86-64). The remaining CPUs have byte reversal instructions for registers, so a reversed byte order load or store can be simulated by a sequence of 2 instructions.

So the little-endian types and the big-endian data types must be handled identically by a compiler, except that the load and store instructions use different encodings.

The structures used in a data-exchange format must be declared with the correct types and that should take care of everything.

Any decent programming language must provide means for the user to define such data types, when they are not provided by the base language.

The traditional UNIX conversion functions are the wrong way to handle endianness differences. An optimizing compiler must be able to recognize them as special cases in order to be able to optimize them away from the machine code.

A program that is written using only data types with known endianness can be compiled for either little-endian targets or big-endian targets and it will work identically.

All the problems that have ever existed in handling endianness have been caused by programming languages where the endianness of the base data types was left undefined, for fear that recompiling a program for a target of different endianness could result in a slower program.

This fear is obsolete today.

link

cv5005 82 days ago

Having different types seems wrong to me because endianess issues disappears after serialization, so it would make more sense to slap an annotation on the data field so just the serializer knows how to load/store it.

link

josephg 80 days ago

Nah, that's a terrible way to handle endian-ness. Your "big endian" types infect your entire program. And you pay a cost with every computation you do with them.

Just treat the data on disk / on the wire as if it were in some encoded format. Parse on load. Encode back out to the expected format when you save it. Within your program, just use your language's native int formats.

For example, in C I use something like this:

    uint32_t read_be_u32(uint8_t data[4]) {
        return ((uint32_t)data[0] << 24) |
            ((uint32_t)data[1] << 16) |
            ((uint32_t)data[2] << 8)  |
            ((uint32_t)data[3]);
    }

... And the equivalent for little endian data. Modern optimizers will happily turn that into the right instructions - either a noop or bswap - as appropriate depending on the target architecture.

You can do the same thing in Rust, Go, or any other language. No special type definitions or macros necessary.

https://godbolt.org/z/746EaYx4r

link

whizzter 83 days ago

MacOS "was" big-endian due to 68k and later PPC cpu's (the PPC Mac's could've been little but Apple picked big for convenience and porting).

Their x86 changeover moved the CPU's to little-endian and Aarch64 continues solidifies that tradition.

Same with Java, there's probably a strong influence from SPARC's and with PPC, 68k and SPARC being relevant back in the 90s it wasn't a bold choice.

But all of this is more or less legacy at this point, I have little reason to believe that the types of code I write will ever end up on a s390 or any other big-endian platform unless something truly revolutionizes the computing landscape since x86, aarch64, risc-v and so on run little now.

link