Hacker News new | ask | show | jobs
by ubernostrum 3200 days ago
that explanation of UTF-8 is crap. UTF-8 is beautiful quite apart from its utility, but you'd hardly know it from the article

My goal was not to judge UTF-8 aesthetically, but to explain how it works and point out that it's a variable-width encoding which emphasizes its compatibility with ASCII for strings containing only code points <= U+007F.

Unicode Consortium et. al. are absurdly arrogant.

I would agree that Unicode as it exists today involves some historical and historic bad decisions. But again, staying off value judgments with respect to Unicode itself since the point of the article was to explain how Python now handles it internally.

1 comments

Oh, Hi there.

Apologies for being cranky. You did a great job explaining how Python now handles Unicode!

To me it was strange reading about UTF-32 first and then getting to UTF-8 from that context. It seemed to obscure the coolth and beauty of the format.

Overall a great article, sorry again for being so negative.

That section was written for people who know little to nothing about Unicode and the ways Unicode can be encoded to bytes. So it starts with the obvious approach -- just spit out a sequence of bytes whose integer values are the code points, which is near enough as makes no difference to how UTF-32 works -- then introduces variable-width encoding through the history of UCS-2 and UTF-16, then gets to UTF-8 and what motivated it.

The advantages/disadvantages of the various encodings is something that could eat up several pieces just as long as the entire post, and for fun I'd probably throw in weird stuff like the attempt to do EBCDIC-compatible UTF instead of ASCII-compatible, etc.

Someone should write up EBCDIC-based UTF as an RFC. I'm sure that there's at least one COBOL programmer out there that has been waiting for that for decades.

ETA: Mostly a joke, but it would also fit right in with things like WTF-8 (https://simonsapin.github.io/wtf-8/)

It wasn't a joke. UTF-EBCDIC is a Unicode Technical Report:

http://www.unicode.org/reports/tr16/

aw, now i'm cranky again. lol