|
|
|
|
|
by mjevans
2271 days ago
|
|
The Unicode issue is one of the most painful sticking points. Handle legacy filesystems with the chance of filesystems that have previously valid (non / non \0 containing filenames) files in various encoding soup nightmares? Python3 is NOT the the tool for that job! Which means you can't write any systems tool stuff in Python3 because there's a good chance it'll blow up unexpectedly, or that you suddenly have to handle all sorts of things that in any other language you can just GIGO and move on. |
|
It would be unwise to design an API that promotes a niche need like dealing with a legacy file system with corrupted file names. This is what Python 3 fixed, making the easy things easy, and the complicated things possible, not the other way around.
But Python 3 is absolutely up to the task of handling legacy file systems with random encoding mixed in, you just need to tell it explicitly you are doing so.
Let's create a file with a completely garbage name, made of random bytes, which is allowed on Unix:
If you pass bytes to any file system function, it will return file names as bytes: And you can just open that: You do exactly the same as what you did with Python 2, and treat the files as raw bytes entirely, without thinking about the content.Some API in Python require text. If you want to pass the file names to those API, you can use surrogate escape, which let you convert back and forth between arbitrary bytes and utf8 text, without loosing information:
This is a good thing, it forces the dev to be explicit about the places in your code where you are dealing with a specific scenario. It also makes you pay the price of doing so opt in, not opt out.If you have to do a robust version of this with Python 2, you will have to do that anyway: at some point mixed encoding will bite you if you don't have a neutral representation for them. Python 2 gave you the illusion of robustness, because it said "yes" to most operations.
I remember quite well that a lot of Python 2 programs didn't work in Europe because your user directory would contain your name, which could be non ascii. Python 2 programs are opt in to deal with it. It's the opposite philosophy, and caused so many crashes.