Hacker News new | ask | show | jobs
by stefco_ 2403 days ago
> the broken programs would still be broken in either language.

You need to slap a decode anyway on reads from subprocesses in python3, and files open in Unicode mode by default. Wouldn't that fix the majority of silly UTF-8 compat bugs? Or am I missing a class of bugs that's not avoided automatically by python3 strings?

1 comments

Well, the summary of the argument is that the python3 UTF-8 does not actually solve the fundamental problem of multiple encoding formats existing. Think: Do you know that the process actually returns UTF-8, or that the file is actually encoded in UTF-8? No, you're just guessing. This puts people in the habit of attempting to turn everything into UTF-8 which could happen automatically and not require so much boilerplate.

On the other end, most programs don't actually care what the data encoding is. They just move it.

> Think: Do you know that the process actually returns UTF-8, or that the file is actually encoded in UTF-8? No, you're just guessing.

Well, no, not really. You go read the docs and try to find out. Most of the time, there is a definitive encoding - if there weren't, a lot more things would be broken. Sometimes, it is not guaranteed, even though de facto that is the case - and this highlights broken interface specifications. When it is truly unknown, you explicitly treat it as raw bytes.

And the good thing about Python 3 is that it forces you to think about this. In Python 2, most of the time, data processing code can be hacked together, and it "just works", right until the point the input happens to include something unanticipated. Like, say, the word "naïve".

> On the other end, most programs don't actually care what the data encoding is. They just move it.

It doesn't necessarily mean that they get to dodge the bullet. In Python 2, if you read data from a file, you get raw bytes, but if you read data from parsed JSON, you get Unicode strings - because JSON itself is guaranteed to be Unicode. Guess what happens when the byte string you've read from the file, and the Unicode string you've read from a JSON HTTP response, are concatenated?