| HN Mirror

Y	Hacker News new \| ask \| show \| jobs


	by mynegation 2353 days ago
	Everything is bytes but the meaning assigned to bytes, matters. Let’s say I create a file named «Файл» on Unix in UTF8 and put it into git repo. For Unix it is a sequence of bytes that is representation of Russian letters in UTF8. So far so good. Now I clone this repo to Windows, what should happen? The file can not be restored with the name as encoded into bytes on Unix, that will be garbage (that even has a special name “Mojibake”) in the best case or fail outright in the worst. What should happen is decoding of those bytes from UTF8 (to get original Unicode code points) Into Unicode code points, then encoding using Windows native encoding (UTF-16).

2 comments

mikepurvis 2353 days ago

True, but one of those representations still needs to be canonical one in the repo for the purposes of hashing into the commits and so on.

Git builds a bunch of logic like this in around handling line endings in text files.

link

cbsmith 2353 days ago

Everything isn't bytes. Strings without an encoding don't have a specific byte representation.

link

takeda 2353 days ago

It's the other way around. Strings always have meanings and always reference the same characters. You use encoding to encode strings into bytes.

Bytes without encoding, don't have any meaning, they are just... random bytes.

link

cbsmith 2351 days ago

We're actually saying the same thing. You're saying without an encoding you can't turn bytes into a string (technically, in Python terminology, that's a decoding, but you know... ;-). I'm saying a string doesn't have a byte representation without an encoding. That's two perspectives on the same truth.

I absolutely agree that a string has meaning without a byte representation. That's the whole point of having it as a distinct type.

link