Hacker News new | ask | show | jobs
by mynegation 2353 days ago
Everything is bytes but the meaning assigned to bytes, matters. Let’s say I create a file named «Файл» on Unix in UTF8 and put it into git repo. For Unix it is a sequence of bytes that is representation of Russian letters in UTF8. So far so good. Now I clone this repo to Windows, what should happen? The file can not be restored with the name as encoded into bytes on Unix, that will be garbage (that even has a special name “Mojibake”) in the best case or fail outright in the worst. What should happen is decoding of those bytes from UTF8 (to get original Unicode code points) Into Unicode code points, then encoding using Windows native encoding (UTF-16).
2 comments

True, but one of those representations still needs to be canonical one in the repo for the purposes of hashing into the commits and so on.

Git builds a bunch of logic like this in around handling line endings in text files.

Everything isn't bytes. Strings without an encoding don't have a specific byte representation.
It's the other way around. Strings always have meanings and always reference the same characters. You use encoding to encode strings into bytes.

Bytes without encoding, don't have any meaning, they are just... random bytes.

We're actually saying the same thing. You're saying without an encoding you can't turn bytes into a string (technically, in Python terminology, that's a decoding, but you know... ;-). I'm saying a string doesn't have a byte representation without an encoding. That's two perspectives on the same truth.

I absolutely agree that a string has meaning without a byte representation. That's the whole point of having it as a distinct type.