| HN Mirror

Y	Hacker News new \| ask \| show \| jobs

by ptx 4107 days ago

The problem is that the Windows commandline, legacy Windows programs and modern Unix systems all use different encodings, so any particular string of bytes (representing non-ASCII text) will only be correct on one of them.

For example, let's say our Other Country is a Western European country. The encoding for non-Unicode Win32 programs will be Windows-1251 (more or less ISO 8859-1) and the encoding for MS-DOS programs and the commandline will be codepage 850.

In this scenario, this Python 2 program (saved as UTF-8):

  #-*- coding: utf-8 -*-
  print "ångström"

will print the wrong thing – "├Ñngstr├Âm" if you run it from the commandline, and "Ã¥ngstrÃ¶m" in a more Windowsy context (e.g. if you're writing it to a file and reading it in Notepad).

To make it correct, you can apply the Unicode sandwich approach:

1) Know the input encoding and decode from that to Unicode.

2) Process the text as Unicode.

3) Know the output encoding and encode into that encoding on output.

In other words, making it a Unicode string will transform the text from whatever encoding you chose to write the file in to whatever encoding your terminal happens to use, so this program will always (if the system is correctly configured) print the right thing:

  #-*- coding: utf-8 -*-
  print u"ångström"

In Python 3, UTF-8 source encoding and Unicode strings are the default, so the correct program becomes simply:

  print("ångström")