Hacker News new | ask | show | jobs
by mjevans 2271 days ago
The Unicode issue is one of the most painful sticking points. Handle legacy filesystems with the chance of filesystems that have previously valid (non / non \0 containing filenames) files in various encoding soup nightmares? Python3 is NOT the the tool for that job! Which means you can't write any systems tool stuff in Python3 because there's a good chance it'll blow up unexpectedly, or that you suddenly have to handle all sorts of things that in any other language you can just GIGO and move on.
1 comments

Python 3 promotes the most likely scenario: you are on a modern system with normal looking filenames.

It would be unwise to design an API that promotes a niche need like dealing with a legacy file system with corrupted file names. This is what Python 3 fixed, making the easy things easy, and the complicated things possible, not the other way around.

But Python 3 is absolutely up to the task of handling legacy file systems with random encoding mixed in, you just need to tell it explicitly you are doing so.

Let's create a file with a completely garbage name, made of random bytes, which is allowed on Unix:

    >>> import os, sys
    >>> sys.version_info          
    sys.version_info(major=3, minor=7, micro=5, releaselevel='final', serial=0)
    >>> with open(os.urandom(32), 'wb') as f: 
    ...     f.write(os.urandom(200)) 
If you pass bytes to any file system function, it will return file names as bytes:

    >>> filename = os.listdir(b'.')[0]
    >>> type(filename)
    <class 'bytes'>
    >>> filename[:10]   
    b'\xf8-U\xa5\x1dq\xad?\xbf\xa2'
And you can just open that:

    >>> data = open(filename, 'rb').read() 
    >>> data[:10]
    b'E\x05\xce*M \xf5\xfeK\x18'
    >>> type(data)
    <class 'bytes'>
You do exactly the same as what you did with Python 2, and treat the files as raw bytes entirely, without thinking about the content.

Some API in Python require text. If you want to pass the file names to those API, you can use surrogate escape, which let you convert back and forth between arbitrary bytes and utf8 text, without loosing information:

    >>> as_text = filename.decode('utf8', errors='surrogateescape')
    >>> as_text[:10]
    '\udcf8-U\udca5\x1dq\udcad?\udcbf\udca2'
    >>> type(as_text)
    <class 'str'>
    >>> as_text.encode('utf8',  errors='surrogateescape') == filename
    True
This is a good thing, it forces the dev to be explicit about the places in your code where you are dealing with a specific scenario. It also makes you pay the price of doing so opt in, not opt out.

If you have to do a robust version of this with Python 2, you will have to do that anyway: at some point mixed encoding will bite you if you don't have a neutral representation for them. Python 2 gave you the illusion of robustness, because it said "yes" to most operations.

I remember quite well that a lot of Python 2 programs didn't work in Europe because your user directory would contain your name, which could be non ascii. Python 2 programs are opt in to deal with it. It's the opposite philosophy, and caused so many crashes.

I failed to mention that it's only useful to do this manually if you need to decode path with mixed encodings, or share said path with the rest of the world.

If all you need is to work with arbitrary mixed bags of file names, you can just pass "str", and Python will use automatically and transparently surrogateescape everywhere.

I think that the main complaint is that the early versions of python 3 had deep issues with this and it took years to fix them.
The first versions of python 3 were indeed, not suitable for serious work. Python 3 started to be usable from 3.3, interesting from 3.4, comfortable from 3.5, and objectively way better than 2.7 from 3.6.

It's not a surprise, as most softwares need iterations to get get good. Python 2.7 has not started as the amazing tool it is now, and as I started my career with 2.4, you remember some funny stuff.

This is why Python 2.7 was kept around for 13 years after Python 3 first came out.

Now, 3.6 came out in 2016. It solves many issues 2.7 had, and add tons of goodies. It's very ergonomic, can be installed easily. It's a great software.

It's 2020, let's enjoy the goodness of Python 3.

But this is exactly why people keep piling on about python 3, as an "upgrade" it was worse than what tried to replace for 8 years.
Sure.

And now it has been better for the last 5 years, and does solve the problems it intended to solve.

No one disputes that today's Python 3 is better that today's Python 2, especially for new projects. (A point can be made that Python 2 now is perfectly frozen in time and so 100% stable, but most people do not care that much)

The only thing this means is that the botched upgrade did not end up killing Python; it says nothing about whether it was done badly or not.

(I have nothing against python, I just believe it is important to understand why and how what happened happened to avoid similar errors in the future)