| Great writeup (and a cool metaclass workaround!) However, I think that "Drop 2.5, 3.1 and 3.2" advice is bad - dropping 2.5 and 3.1 is the way to go (hey, drop 2.5 even if you're not porting to 3.x), but dropping 3.2 is not necessary in most cases. In my experience (porting and maintaining 20 open-source packages that work with Python 2 and Python 3 using a single codebase, including NLTK) Python 3.2 has never been a problem - I don't see how NLTK code and code of my other packages could be improved by dropping Python 3.2 compatibility. The main argument for dropping Python 3.2 support seems to be that u'strings' are not supported in Python 3.2. There are 3 "types" of strings in Python: * b"bytes", * "native strings" # bytes in 2.x and unicode in 3.x * u"unicode" By adding `from __future__ import unicode_literals`` line to a top of the file, code compatible with 2.6-3.2 could be written like this: * b"bytes" * str("native string") # bytes in 2.x and unicode in 3.x * "unicode" In my opinion this is not a hack (unlike six.b and six.u necessary for 2.5 support), and this is arguably closer to Python 3.x semantics (unicode strings are default). So IMHO while using u"unicode" feature from Python 3.3 makes porting somewhat easier (less stupid search-replace), it also could make code worse and more cluttered, and Python 3.2 - compatible syntax is just fine. It is true that 3.3 brings other improvements (Armin mentioned binary codecs), but it is quite rare that the library actually needs them (even libraries as big as NLTK and Django are fine with 3.2 stdlib). 3.2 is a default 3.x Python in current Ubuntu LTS (EOL in 2017) and a default 3.x Python in the recently released Debian Wheezy; 3.2 will be around for a long time, and not supporting it will hurt. So if you're doing Python 3.x porting, please just fix those stupid u'strings' with unicode_literals future import - your code will be more ideomatic and also 3.2 compatible. There is also an advice for encoding __repr__ and __str__ results to utf8 under Python 2.x in the article; this is fine (other approaches are not better), but it has some non-obvious consequences (like breaking REPL in some setups) that developers should be aware of, see http://kmike.ru/python-with-strings-attached/ For lower-level 2.x-3.x compatible C/C++ extensions Cython is great. In fact, many libraries (e.g. lxml) are compatible with Python 3.x because they are written in Cython which generates compatible code (modulo library changes) by default. |
It's not a problem if you are willing to litter your code with calls or upgrade a ton of code in 2.x to unicode accidentally. There are just too many cases in 2.x where that is a terrible idea and introduces subtle bugs. I very strongly recommend against `from __future__ import unicode_literals`. If anything go with six.
In regards to supporting 3.2: I don't think anyone cares. The number of people currently using Python 3 is pretty low and a lot of libraries are already dropping 3.2 support. Requests, MarkupSafe, Jinja2 now all dropped 3.2 support and with that a lot of stuff that pulls in dependencies to those will now also depend on 3.3.
I still think people should stick to 2.7 for at least another one, two years and at that point a lot will have changed.
//EDIT: wrt __str__ returning utf-8 data: __str__'s encoding is undefined but usually accepted to be > ASCII. Django and Jinja2 for instance returned utf-8 there for years and it did not cause any problems.