| HN Mirror

Y	Hacker News new \| ask \| show \| jobs

by ubernostrum 2735 days ago

The way I usually put it is that Python 3 shifted its priorities.

Previously, if you were a UNIX-y scripter writing UNIX-y scripts on your UNIX-y OS, the fact that Python just kind of pretended encoding issues would never exist was a help to you. It adopted the same "everything is ASCII, or at most UTF-8 in the ASCII range, and if it isn't I'll break in cryptic ways" approach as most other UNIX-y scripting things.

If you were doing anything other than UNIX-y scripting on your UNIX-y OS, this easily became a huge nightmare in Python 2. Django went through a massive rewrite early in its history precisely because of this, to ensure that encoding/decoding happened at the boundaries and everything you'd work with inside a Django app was already a Unicode string. And I remember what it was like trying to work on the web before that approach, and what the work to fix it was like.

Python 3 decided to make the UNIX-y scripters actually learn what a horrid mess UNIX is with respect to locales and encodings and filesystem paths, in order to free the rest of the Python community from the nightmares inflicted by prioritizing the UNIX-y scripters to the exclusion of everyone else. So yes, you have to do more work. Yes, you have to learn that a file path is actually an opaque bag of bytes that may not be in any actual encoding and thus can never properly decode to a string. Yes, you have to learn to use fsencode() and the surrogateescape handler in order not to blow up your scripts.

But I'm OK with that, because it puts the workload on you when you're using such a system, rather than magically trying to fix it for you at the cost of everyone else's sanity. It also means that you have to learn to write those scripts correctly. Which is more work than what Python 2 required, but not the world-ending apocalyptic horror it's usually presented as (and is, again, mostly the fault of UNIX-y systems doing their old UNIX-y things, not the fault of Python).

1 comments

slavik81 2735 days ago

So, write the program correctly. Show me.

> Python 3 decided to make the UNIX-y scripters actually learn what a horrid mess UNIX is with respect to locales and encodings and filesystem paths

Show me how much Python 3 improves this. To expand on the program before, make it a directory named 'input' full of files, and a directory named 'output' to put the processed files in. Print each file name as the corresponding file is processed to indicate progress.

I would be applauding Python if it did make the difficulties with this exercise obvious, but it absolutely does not. The file system APIs return strings, but the strings they return may not be valid Unicode. PEP 383 turned the Python 3 str type into a bag of bytes.

Python tries to sweep encodings under the rug. It makes the encoding a default value all over the place and hides conversions everywhere.

I 100% agree that developers need to think about encodings and handle them in their programs. That's exactly why I hate string handling in Python 3: because rather than making you handle the corner cases, it pretends they don't exist, until they're found one by one by your users.

Python 3 encourages developers to write broken string handling code.

link

ubernostrum 2735 days ago

So, write the program correctly. Show me.

You just want to fight someone because you're angry, and I don't do that. No matter what someone writes you'll find a way to argue into it being wrong and then prance around declaring "victory".

(I'd also bet that you probably couldn't do it if I were the one who got to set the evaluation criteria, and you also couldn't pass other "challenges" like writing proper HTTP handling -- the person who gets to grade the challenge always "wins", which is why you want to be the person who grades the challenge)

PEP 383 turned the Python 3 str type into a bag of bytes

PEP 383 provided a way to read certain things -- primarily filesystem paths which can be basically anything -- using an escape mechanism to replace non-decodable bytes with surrogates when decoding to string, which in turn allow losslessly transforming back to the original bag of bytes.

Which is necessary, because there are real filesystems out there that really have paths and names that can never validly decode from any known text encoding. It doesn't turn strings into "bags of bytes"; the resulting str still is an iterable of actual valid Unicode code points.

Python tries to sweep encodings under the rug.

As the saying goes, you can't reason someone out of a position they didn't reason their way into, so I won't try here.

Python 3 encourages developers to write broken string handling code.

Python 3 no longer tries to cover for the random gibberish that's legal to put in filesystem paths, and makes the developer handle it. Are there lots of developers out there who don't realize that filesystem paths can legally contain undecodable garbage? Sure. That's not Python's problem to solve, though; it gives you the surrogateescape handler, and the fsencode helper, and keeps working on things like PEP 538 and PEP 540 to try to give you tools to work around it. But Python can't magically fix the mess that is UNIX locales and bag-of-bytes paths (nothing can, short of burning UNIX down and starting over), and doesn't try to do it for you.

link

slavik81 2735 days ago

You're right that I'm frustrated. I'm not even upset at anyone in this thead, but from previous discussions. I appologise for carrying that baggage here. Having my morality questioned when I legitimately want to make programs work correctly for foreign languages has left me... emotional on this subject.

Unix has its own problems, but the program avian wrote works correctly on Linux. Most encoding issues I encountered when working with Python 3 were on Windows.

PEP 383 was making the best of a bad situation without breaking the API again. The real mistake was having the functions return strings in 3.0. The operating system APIs should have returned path objects that require an explicit conversion to string with an explicit error handling mechanism.

Python gives you all the tools you need to do this right, but they're easy to unknowingly use in ways that break on corner cases. A well-defined API should guide you towards the correct solution and should make pitfalls obvious.

In any case, I should probably give it a rest. I work hard to make sure my programs do this stuff right, and I suppose that's all I can really do.

link