Hacker News new | ask | show | jobs
by dr_zoidberg 4099 days ago
I'm from "another country" and I always asumed all the characters that couldn't be printed on screen by Python were cmd.exe's (and powershell) fault for not handling Unicode correctly, not a Python "error" per se.

Also, all my Python sources are set to UTF-8 and I never had any problem in Windows. Notepad.exe gives you the encoding option when you save a file, and every sensible text editor/IDE gives you encoding and line feed options.

So what would be the problem with Zed's tip? Have you ever tried to run a Python script with special characters? The interpreter dies instantly with an encoding error. It's easier to set the encoding to UTF-8 and get the program running than parse the whole thing checking whether you used a special character in the comments -- which shouldn't affect program execution, but hey!. Also, this way you can write meaningful comments in your native language without worrying if it'll kill the interpreter right away.

2 comments

The problem is that the Windows commandline, legacy Windows programs and modern Unix systems all use different encodings, so any particular string of bytes (representing non-ASCII text) will only be correct on one of them.

For example, let's say our Other Country is a Western European country. The encoding for non-Unicode Win32 programs will be Windows-1251 (more or less ISO 8859-1) and the encoding for MS-DOS programs and the commandline will be codepage 850.

In this scenario, this Python 2 program (saved as UTF-8):

  #-*- coding: utf-8 -*-
  print "ångström"
will print the wrong thing – "├Ñngstr├Âm" if you run it from the commandline, and "Ã¥ngström" in a more Windowsy context (e.g. if you're writing it to a file and reading it in Notepad).

To make it correct, you can apply the Unicode sandwich approach:

1) Know the input encoding and decode from that to Unicode.

2) Process the text as Unicode.

3) Know the output encoding and encode into that encoding on output.

In other words, making it a Unicode string will transform the text from whatever encoding you chose to write the file in to whatever encoding your terminal happens to use, so this program will always (if the system is correctly configured) print the right thing:

  #-*- coding: utf-8 -*-
  print u"ångström"
In Python 3, UTF-8 source encoding and Unicode strings are the default, so the correct program becomes simply:

  print("ångström")
The problem isn't extended characters in your Python script, it's how your Python script handles extended character data. Scripts written in Python 2 that ignore the existence of Unicode won't always do the right thing when they encounter non-ASCII strings in the wild.