Hacker News new | ask | show | jobs
by Avernar 3209 days ago
I'm not a fan of how Python 3 stores Unicode strings internally. In my opinion they should have went with UTF-8. The extra scanning and conversion puts more preassure on the processor and caches under load.

I agree that Python 2's Unicode handling is broken. That's why I just stored UTF-8 in a normal string and avoided the whole mess. The only thing I have to do is validate any input from the outside world is really UTF-8.

2 comments

Since the high-level API is supposed to let you treat a string as a sequence of code points, a correct implementation (which Python didn't have until 3.3!) would've imposed the overhead of conversion to something resembling a fixed-width encoding whenever a programmer invoked certain operations.

And the vast majority of strings in real-world Python contain only code points also present in latin-1, which means they can be stored in one byte per code point with this approach. And for strings which can't be stored in one byte per code point, you were similarly going to pay the price sooner or later.

> Since the high-level API is supposed to let you treat a string as a sequence of code points,

I disagree with that premise. It should operate on grapheme clusters. Operating on code points falls into the same trap as operating on bytes.

> a correct implementation (which Python didn't have until 3.3!) would've imposed the overhead of conversion to something resembling a fixed-width encoding whenever a programmer invoked certain operations.

Those operations should have been removed. Indexing is the big one that needs fixed width internal representation for speed. Code could have been rewritten to not require indexing. But mechanical translation from Python 2 to 3 was a goal and because of that they couldn't radically change the unicode API for the better.

> And the vast majority of strings in real-world Python contain only code points also present in latin-1, which means they can be stored in one byte per code point with this approach. And for strings which can't be stored in one byte per code point, you were similarly going to pay the price sooner or later.

You're going to pay the price for 4 byte per codepoint strings quite often. A single emoji will blow up a latin-1 string to 4 times the size.

> That's why I just stored UTF-8 in a normal string and avoided the whole mess.

This only works if every library that you use agrees with you on this, and treats all strings you pass to it as UTF-8 whenever encoding matters.

OTOH, if you don't care about that, then you might as well just use bytes everywhere, and get the same thing. At least in Python 3, with bytes, if a library does try to use it as a string, you'll get an error, rather than silent wrong output.

> This only works if every library that you use agrees with you on this, and treats all strings you pass to it as UTF-8 whenever encoding matters.

Not exactly. The library just has to treat it as a string and not worry about the encoding (i.e. not try to encode it to/from the unicode type).

Only ran into this issue once and the library had an option to return everything as string so not a problem.

> At least in Python 3, with bytes, if a library does try to use it as a string, you'll get an error, rather than silent wrong output.

Bytes in Python 3 don't support string operators.

Bytes in Python 3 don't support string operators.

Slight nitpick: `bytes` objects in Python 3 do not share all of the operations and methods available on `str`, but do share quite a few. Notably, `bytes` will never implement format(), but it does implement printf()-style formatting via the modulo operator.

The `bytes` and `bytearray` types implement the following methods which also exist on `str` (in some cases, with the caveat that the operation only makes sense if the bytes in question are in the ASCII range):

capitalize(), center(), count(), endswith(), expandtabs(), find(), index(), isalnum(), isalpha(), isdigit(), islower(), isspace(), istitle(), isupper(), join(), ljust(), lower(), lstrip(), maketrans(), partition(), replace(), rfind(), rindex(), rjust(), rpartition(), rsplit(), rstrip(), split(), splitlines(), startswith(), strip(), swapcase(), title(), translate(), upper(), zfill()

I didn't realize the modulo operator for bytes was added. most information I've run across said it didn't work.

Unfortunately most libraries for 3 will be using str so using bytes with UTF-8 inside will become more and more difficult.

> I didn't realize the modulo operator for bytes was added. most information I've run across said it didn't work.

It was added in Python 3.5 (IIRC that's the last backwards compatibility feature added, I don't remember 3.6 adding any, or any being planned for 3.7).

> The library just has to treat it as a string and not worry about the encoding (i.e. not try to encode it to/from the unicode type).

If I pass a library a string it receives a Unicode string, bytes already decoded using an encoding. It shouldn't be able to re-decode that in any way, whatever that is supposed to mean on a technical level.

If a library receives a byte-array representing text, that is a completely different matter and talking about encodings is fully appropriate, even required.

But this matter should predominantly exist at your application's barrier, when doing IO.

If you're regularly doing encoding and decoding anywhere else, you're doing something wrong (or your language is).

Look back a few posts. We're discussing using UTF-8 in str and avoiding the unicode type in Python 2.

I'n my use case I validate the string as UTF-8 from the internet. To and from the database is UTF-8 so no validation is required there. Output back to the internet requires no additional steps.

Nowhere in this method is encode or decode required or desired.

> Not exactly. The library just has to treat it as a string and not worry about the encoding (i.e. not try to encode it to/from the unicode type).

Or do anything else that implies encoding. Like measure length, index, slice, change case etc.

Change case, yes, that would require actually decoding the string to the unicode type. But that could be done when needed and not every time something from my databse needs to go out to the client.

Slicing works fine on a UTF-8 string as I'm slicing between ASCII characters which don't appear inside a non ASCII character. If I needed to slice between certain code points it would still be easy as I just look for the appropriate 2-4 byte sequence and slice before or after it. Python doesn't support graphemes so can't do much with those.

Measuring length is not something that comes up for me. And indexing to an absolute spot in a string never comes up at all.

But yes, if I did have to call a text processing library I'd have to then encode/decode to the Unicode type. But that's rare enough that I can keep everything UTF-8.