Hacker News new | ask | show | jobs
by svennek 3275 days ago
Really?

I find it so much easier (I am also not an English language programmer), because it is very explicit about what is the codepoints (i.e. string) and what is the bytes (i.e. on the disk or the network).

The encode/decode functions is basically the only change, but it gives you the reason to explicitly tell it about the encoding you expect..

But as you left python, the point is moot :)

3 comments

Python 2 was also explicit: it had unicode and it had str. I never had a problem managing complex character encoding issues in Python 2.
Agreed. I'm in the same position and I'd also love to hear what the pain points were for OP.
You're right that you can get Python 3 to behave correctly if you jump through some hoops. However, wasn't the point of Python 3 to remove the hoopjumping in the first place?

Anyways, the point is that Python 3's design was broken from the start. Anglophone programmers think that internationalisation means "upgrading" from ASCII to Unicode. In reality, though, the rest of the world's programmers got along just fine long before Unicode was invented and will continue to do fine long after Unicode gets replaced by something else.

True internationalisation means mechanisms to deal with the world's text encodings in a neutral and culture-agnostic way. (And "Unicode only everywhere and no exceptions" is definitely not that.)

> You're right that you can get Python 3 to behave correctly if you jump through some hoops. However, wasn't the point of Python 3 to remove the hoopjumping in the first place?

The hoops of encoding and decoding bytes are not optional, manipulating text and manipulating arbitrary bytes are not the same thing and encoding and decoding is how you translate between the two domains. Python 2 hid this for a subset of the bypes and code was usually broken as a result. Python 3 requires that this split be taken in account in all cases (as do e.g. Java or C#) and is significantly better as a result.

> Anglophone programmers think that internationalisation means "upgrading" from ASCII to Unicode.

Which is a pretty significant upgrade from their previous case of literally not giving a fuck.

> True internationalisation means mechanisms to deal with the world's text encodings in a neutral and culture-agnostic way.

That's completely meaningless word salad.

And just to clarify: it's completely meaningless word salad because

1. encoding and decoding has relatively little to do with internationalisation

2. encoding and decoding is no more culture specific than the source encoding is, if you're dealing with culture-specific encodings aside from not doing anything with the content (not even displaying it) you can't be culture-agnostic until after you've decoded the text

3. you can't "neutrally" deal with "the world's text encoding" (whatever that's supposed to mean in your mind) because most of them are not compatible with one another since they use the same binary space for completely different text mappings

If you see the converting as hoop-jumping then I agree with you.

If you do recognize that there is such a thing called text, that human understand, and for example the text "hello" has five "letters".... The point that that text might be represented in five different (and equally valid) ways with different amounts of bytes is irrelevant to the "human user"...

Hence the sharp distinction between makes sense to most humans.

The danish word "rødgrød" for example is seven letters (i.e. every person would say it is 7 letters long), a computer would call it 7, 7 9 or 16 bytes (in cp865 (nordics) latin1("wester european", utf8 and utf16...)

Unless you work in the ideal "text" mode getting the correct length of the text is not that trivial. Equally, the pair 2-3 and and 7-8 bytes in utf8 must not be split (as either half has not meaning on its own).. hence (in utf8) a function "give me the next letter" will return 1,2,1,1,1,2,1 bytes in its revocations..

Also casefolding (upper/lower case) is hard when working in bytes (some glyphs might not even have one or the other)..

I am unable to say if unicode fits all languages in the world as needed, but it is much better as byte-wrangling if you have multiple possible encodings at once..

EDIT: for reference the bytes of "rødgrød" are

cp865 : b'r\x9bdgr\x9bd'

latin1 : b'r\xf8dgr\xf8d'

utf8 : b'r\xc3\xb8dgr\xc3\xb8d'

utf16le: b'r\x00\xf8\x00d\x00g\x00r\x00\xf8\x00d\x00'

utf16be: b'\x00r\x00\xf8\x00d\x00g\x00r\x00\xf8\x00d'

(and it cannot be written in ASCII ("C"))

Anyone saying that bytes and text are the same are nuts...

The bytes and bytestring types in Python 3 are basically the str type of Python 2 and I don't see that as much of a hoop to jump through. The Unicode default seems like a very sane decision from my point of view. Your argument primarily seems based around an anti-Unicode viewpoint. Am I missing something?