Hacker News new | ask | show | jobs
by excessive 2426 days ago
> There were some intractably difficult, "rip-the-band-aid-off" types of changes that had to happen at some point.

I suspect you're referring to Unicode here. In that case, I think they could've just added a flag to Python 2's str type to indicate "is UTF-8" and deprecated the old unicode object. Then add some functions to extract code points or grapheme clusters or whatever else you need from the old school str object.

I might be in the minority, but I really like that Python 2's str could hold arbitrary binary data, of which UTF-8 is just one possibility. It had good interop with C, which I think is fundamental to a glue language like Python. I'd rather have fewer string types instead of more (one of my complaints about Rust too).

If you meant the print function, there were other ways to solve that too. The simplest might be to create a new name for the function and deprecate the print statement. So old code uses the statement "print 123" while new code is encouraged to call the function "echo(123)" or "ouput(123)". Bikeshed the actual name...

Note when I say "deprecate", I mean provide a timeline over several releases where it continues to work. Then issue deprecation warnings which can be silenced.

All of the newer features in Python 3 (@ operator, async/await, type annotations, etc...) could've been added in a mostly backwards compatible way. (Note: adding async wasn't really backwards compatible even in Python 3).

Anyways, hindsight is 20/20, but I really do think the path for Python 3 was a poor choice in comparison to other options.

2 comments

Your proposal only works well for US ASCII users. What if I want to manage multiple ISO-8859 encodings in conjunction with 7-bit ASCII? Maybe I also have some EUC-JP multi-byte text to deal with. It becomes an intractable mess without explicit encoding management. Someone will absolutely end up misinterpreting encoded text as bytes and cause all manner of compatibility and security issues. Having a Unicode string type forces this to be dealt with even if it is inconvenient when taking in data from outside the Python environment.
> Your proposal only works well for US ASCII users.

No, and I explicitly mentioned UTF-8. My suggestion is that str holds arbitrary immutable binary data and that you have a method which can interrogate whether that binary data is valid UTF-8.

Yes, real world text is messy and there are lots of encodings, compression schemes, and exceptions (UTF-8 with byte order marks, overlong encodings, or surrogate pairs, as examples). If your main task is converting text between outdated or broken encodings, I don't have any problem saying you need a separate library and shouldn't burden the rest of the user base. Despite it's flaws, the majority of the world has settled on Unicode with a UTF-8 encoding.

"Special cases aren't special enough to break the rules."

> I really like that Python 2's str could hold arbitrary binary data

Python 3's str can also hold arbitrary binary data. That ability was introduced in PEP 383.

Thank you for telling me about this - I didn't know they did that...

My first thought as I started reading the PEP was, "Why did they bother adding the 'bytes' type if 'str' is just going to be able to hold everything anyways?"

After looking at more of it though, it seems like they're storing the binary octets as code points in one of several internal Unicode representations. Moreover, they're abusing (reusing?) the range of code points reserved for 16 bit surrogate pairs, but only using the low half of the pair. This is all clever in the bad way.

This seems like a real lack of taste to me, and I doubt the Guido from 1991 would've found it acceptable to have 'str', 'bytes', and 'bytearray' the way they are. (Let's ignore 'buffer' became 'memoryview' for now...) It used to be a simple and elegant language.