Hacker News new | ask | show | jobs
by tedmiston 4055 days ago
> Then all this u"xx" stuff on strings. What's that about? I don't care about unicode. 256 ascii characters is fine for me. If I need Unicode I can do it, but I don't need it by default.

I felt this way before I started spending all of my time on web apps. It's reading user input data from some random public source, like Twitter, that forces it upon you. Then, so quickly it became the best practice to "unicode all the things". I think of analogous to how we store timestamps in UTC always.

1 comments

One difference, though, is that time enjoys a certain natural and intrinsic consensus. For example, we all agree that observable time always flows forward at the same rate.

OTOH: Which characters do and don't belong in unicode and in what order? I don't fucking know. :-)

> OTOH: Which characters do and don't belong in unicode and in what order? I don't fucking know. :-)

Should we use decimalized time or time based on the Babylonian base 60/12 system? Both have clear advantages. I don't fucking know. :-)

The world has standardized on Unicode, which (as a collection of expanding standards) defines the set of valid code points and their order. There's still some debate as to UTF-8 vs. UTF-16LE (and perhaps UTF-16 w/BOM and UTF-32) encodings, but Unicode has clearly won. It's not perfect, but it's silly to pretend Unicode hasn't won.

Source: I used to work as an engineer on the content converter portion of Google's indexing system, which took the world's web pages, PDFs, etc. and converted them into a unified format (the text portion of which is encoded as UTF-8) for the rest of the indexing system. Sure, we saw some percentage of EUC-KR, GB2312, Big5, and Win CP1252 text, but Unicode has clearly won and UTF-8 and UTF-16LE are steadily replacing all other encodings.