Hacker News new | ask | show | jobs
by kmike84 4776 days ago
Great writeup (and a cool metaclass workaround!)

However, I think that "Drop 2.5, 3.1 and 3.2" advice is bad - dropping 2.5 and 3.1 is the way to go (hey, drop 2.5 even if you're not porting to 3.x), but dropping 3.2 is not necessary in most cases.

In my experience (porting and maintaining 20 open-source packages that work with Python 2 and Python 3 using a single codebase, including NLTK) Python 3.2 has never been a problem - I don't see how NLTK code and code of my other packages could be improved by dropping Python 3.2 compatibility.

The main argument for dropping Python 3.2 support seems to be that u'strings' are not supported in Python 3.2. There are 3 "types" of strings in Python:

* b"bytes",

* "native strings" # bytes in 2.x and unicode in 3.x

* u"unicode"

By adding `from __future__ import unicode_literals`` line to a top of the file, code compatible with 2.6-3.2 could be written like this:

* b"bytes"

* str("native string") # bytes in 2.x and unicode in 3.x

* "unicode"

In my opinion this is not a hack (unlike six.b and six.u necessary for 2.5 support), and this is arguably closer to Python 3.x semantics (unicode strings are default). So IMHO while using u"unicode" feature from Python 3.3 makes porting somewhat easier (less stupid search-replace), it also could make code worse and more cluttered, and Python 3.2 - compatible syntax is just fine.

It is true that 3.3 brings other improvements (Armin mentioned binary codecs), but it is quite rare that the library actually needs them (even libraries as big as NLTK and Django are fine with 3.2 stdlib).

3.2 is a default 3.x Python in current Ubuntu LTS (EOL in 2017) and a default 3.x Python in the recently released Debian Wheezy; 3.2 will be around for a long time, and not supporting it will hurt. So if you're doing Python 3.x porting, please just fix those stupid u'strings' with unicode_literals future import - your code will be more ideomatic and also 3.2 compatible.

There is also an advice for encoding __repr__ and __str__ results to utf8 under Python 2.x in the article; this is fine (other approaches are not better), but it has some non-obvious consequences (like breaking REPL in some setups) that developers should be aware of, see http://kmike.ru/python-with-strings-attached/

For lower-level 2.x-3.x compatible C/C++ extensions Cython is great. In fact, many libraries (e.g. lxml) are compatible with Python 3.x because they are written in Cython which generates compatible code (modulo library changes) by default.

2 comments

> Python 3.2 has never been a problem

It's not a problem if you are willing to litter your code with calls or upgrade a ton of code in 2.x to unicode accidentally. There are just too many cases in 2.x where that is a terrible idea and introduces subtle bugs. I very strongly recommend against `from __future__ import unicode_literals`. If anything go with six.

In regards to supporting 3.2: I don't think anyone cares. The number of people currently using Python 3 is pretty low and a lot of libraries are already dropping 3.2 support. Requests, MarkupSafe, Jinja2 now all dropped 3.2 support and with that a lot of stuff that pulls in dependencies to those will now also depend on 3.3.

I still think people should stick to 2.7 for at least another one, two years and at that point a lot will have changed.

//EDIT: wrt __str__ returning utf-8 data: __str__'s encoding is undefined but usually accepted to be > ASCII. Django and Jinja2 for instance returned utf-8 there for years and it did not cause any problems.

In case of NLTK unicode_literals ("unicode by default") fixed a lot of bugs and made other bugs visible, so mileage may vary :)

Could you give an example of cases where unicode_literals is a terrible idea?

3.2 is important for newcomer experience IMHO; it is very common for people starting with Python to use 3.x version and wonder why the code doesn't work. It's a pity high-profile packages are dropping 3.2 support, I wasn't aware Requests and Jinja2 dropped it.

utf8 __str__ definitely caused issues for Django (e.g. `print mymodel` sometimes fails in REPL in Windows with Russian locale); people using REPL in Windows are too used to such errors so they don't complain and blame Windows for this, but that doesn't mean there is no issue.

So will latin1 `__str__` on Russian locales. So will ASCII `__str__` on any locale that is not ASCII compatible. You can't expect the impossible.

In regards to cases where unicode_literals is a terrible idea is any piece of code that then suddenly gets a unicode string which does not expect it. Because unicode coercion in 2.x spreads like a cancer you might not see the failure until someone uses your API. I still have to fix bugs where people accidentally send things coerced to unicode to an API that does not support it.

Additionally: newcomers still should not be using Python 3. There are just too many remaining issues that are annoying to deal with.

Are there non-ascii compatible encodings that are default in any OSes? With ascii-incompatible system/terminal encoding a lot of software will stop working. Strange things happen, but this looks like a theoretical issue, and ascii looks safe. In Python 2.x __str__ of all standard container types are ASCII-only (even if elements has non-ascii __str__), and __repr__ of standard objects is also ASCII-only as far as I can tell. ASCII-only is an option, and it is not uncommon and relatively safe (but it has its own issues of course).

It was exactly this unicode_literals property (turning everything into unicode) that helped to reveal bugs :) For example, models were trained on bytestrings under Python 2.x, and nobody remembers what was the encoding of the text models were trained on. This was unnoticed for several years because instead of raising an exception functions just handled some egde cases (e.g. unicode punctuation) in a suboptimal way. This leads to almost correct results, but with less accuracy/precision/recall. After changing to "unicode everywhere" the issue became visible.

The issue was not with cancer-like turning text into unicode, issue was with the code that works with text and doesn't support unicode. Python 2.x standard library has such APIs, and this causes troubles, but I don't see how it is a bug in the code that works with text and returns unicode.

What I'm writing are common words and a standard "unicode mantra", but anyways.

We could say "programmers should just handle encodings properly, and unicode_literals have nothing to do with this", but this doesn't always work. "Unicode everywhere" makes some code changes necessary, but some of these changes reveal real bugs.

Another story: I took 2 different courses from 2 different top-notch universities at coursera.org where instructors gave us starter code (written in Python 2.x) for programming assignments. The code was not bad, but there were many cases of incorrect encodings handling in most of the provided files (such errors that would be impossible in Python 3.x) - this was the code that was supposed to teach students something (including Python programming).

What I like about unicode_literals is that it makes things more consistent and easier (at least for me) to reason about: if variable is unicode under 3.x, it is unicode in 2.x, the same applies to bytestrings. In cases where different behaviour is necessary (e.g. because of non-unicode API in 2.x stdlib), explicit str("foo") is used; otherwise code is written in Python 3.x and works with the same semantics under Python 2.x.

Just curious, what newcomer issues are you talking about, and who do you mean by "newcomers"?

> There is also an advice for encoding __repr__ and __str__ results to utf8 under Python 2.x in the article; this is fine (other approaches are not better), but it has some non-obvious consequences (like breaking REPL in some setups) that developers should be aware of, see http://kmike.ru/python-with-strings-attached/

I don't see `__repr__` mentioned there, but `__repr__` should basically always be ascii (which a quick glance at your article looks like it mentions).

I'm fine with `__str__` returning (encoding to) `utf-8` generally, as if someone wants something else they can always encode the unicode themselves to what they want, but `.encode(locale.getpreferredencoding())` is also fine with me if you want to be even more polite.

You're right that __repr__ was not mentioned, my bad.

I think `.encode(locale.getpreferredencoding())` is awful because this changes string encoding from run to run, and because locale.getpreferredencoding() could be different (and is different by default e.g. in Cyrillic Windows XP) from both `sys.stdout.encoding` (used for printing) and `sys.getdefaultencoding()` (used for implicit type conversions).

Good point. Honestly I'm careful about calling str on random objects which I know are doing this. But yeah, I guess that's probably a good enough reason to pick an encoding and go with it, which `utf-8` is a good of a choice as any.