| HN Mirror

Y	Hacker News new \| ask \| show \| jobs

by adamtj 3982 days ago

The fix isn't quite right. It may technically produce correct output now, but it's sloppy. The sloppy code is brittle and dangerous and perfect food for bugs, but that's a minor problem. After all, a single mistake or a small bit of sloppy code can only cause a few bugs at most. The major problem is that the sloppiness indicates a possible lack of understanding. A misunderstanding can continue to produce bugs and brittle code indefinitely. Misunderstandings are the devil!

The symptom is that the .encode() comes too early. The general principle is to .decode() as early as possible and to .encode() as late as possible. The results of .encode() should be as temporary as possible -- preferably never even assigned to a variable.

Seeing the encode in the wrong place leads me to suspect that the author is confusing byte arrays and strings. These are two distinct things, but most documentation makes that distinction clear as mud.

The key thing to realize is that strings are not bytes, and bytes are neither characters nor strings. Think of strings as abstract data structures, like hash tables or linked lists. Bytes are binary integers. On the surface, byte arrays are integer arrays, not strings nor hash tables nor lists of objects.

Programs interface with the world via bytes. Files are bytes. The ntetwork is bytes. Everything is bytes. Bytes are not strings. A byte is an 8-bit integer. When you do I/O and get bytes from the world, you must deserialize them into whatever abstract data structure they represent. Ignore the C language and it's misnamed "char" type. A string is an abstract data structure, as is a hash table, or a list. In some sense even binary integers are abstract data structures that need to be serialized. String serializations are called "encodings". Binary integers are serialized by choosing a byte order (big or little endian). There are various standard ways to serialize hash tables and lists, like json, various XML formats, python's "pickle" and "shelve", etc.

When you get bytes from the network or a file and those bytes are supposed to represent a string, you must deserialize the bytes into a string object. This is called decoding. Often you're using a web framework or other library that does this for you. Python 3's file objects do it. If it's not done automatically, then you must do it yourself. You or your framework should decode bytes into a unicode string object as soon as possible. You should do this everywhere that you do input, and then leave your strings as strings for as long as possible. Do all of your operations on strings ("unicodes"), not bytes. You parse strings, join strings, replace characters in strings, trim, find lengths and match regexes on strings ("unicodes"). Doing any of those operations on byte arrays is nonsensical and will lead to bugs. Only when you have your final string completely ready to go should you worry about serializing it for printing or to write to a file or the network. Only then, at the last possible moment, should thoughts like utf-8 or ascii enter your mind.

As written, it's unclear whether "freqs" containing byte arrays or unicode strings. Getting that mixed up can result in failing to find and item which really is in the dict, or miscounting frequencies, or it can even cause more UnicodeEncode/DecodeErrors. By decoding as early as possible and encoding as late as possible, such sneaky bugs are much less likely to occur.

In Python 2, I would have fixed the problem like this:

  for e in results:
      simple_author=e['author'].split('(')[1][:-1].strip()
      if freqs.get(simple_author,0) < 1:
          print ("%(date)s -- %(author)s -- %(title)s" % {
              'date':   parse(e['published']).strftime("%Y-%m-%d"),
              'author': simple_author,
              'title':  e['title'],
          }).encode('utf-8')

You might dislike my multi-line print and would prefer to .join() a list, possibly with a temporary variable. Or, maybe you'd prefer the newer .format(). Regardless, the important point is that the .encode() should happen later than it does in the article.