| HN Mirror

Y	Hacker News new \| ask \| show \| jobs

by int_19h 3213 days ago

> That's why I just stored UTF-8 in a normal string and avoided the whole mess.

This only works if every library that you use agrees with you on this, and treats all strings you pass to it as UTF-8 whenever encoding matters.

OTOH, if you don't care about that, then you might as well just use bytes everywhere, and get the same thing. At least in Python 3, with bytes, if a library does try to use it as a string, you'll get an error, rather than silent wrong output.

1 comments

Avernar 3213 days ago

> This only works if every library that you use agrees with you on this, and treats all strings you pass to it as UTF-8 whenever encoding matters.

Not exactly. The library just has to treat it as a string and not worry about the encoding (i.e. not try to encode it to/from the unicode type).

Only ran into this issue once and the library had an option to return everything as string so not a problem.

> At least in Python 3, with bytes, if a library does try to use it as a string, you'll get an error, rather than silent wrong output.

Bytes in Python 3 don't support string operators.

link

ubernostrum 3213 days ago

Bytes in Python 3 don't support string operators.

Slight nitpick: `bytes` objects in Python 3 do not share all of the operations and methods available on `str`, but do share quite a few. Notably, `bytes` will never implement format(), but it does implement printf()-style formatting via the modulo operator.

The `bytes` and `bytearray` types implement the following methods which also exist on `str` (in some cases, with the caveat that the operation only makes sense if the bytes in question are in the ASCII range):

capitalize(), center(), count(), endswith(), expandtabs(), find(), index(), isalnum(), isalpha(), isdigit(), islower(), isspace(), istitle(), isupper(), join(), ljust(), lower(), lstrip(), maketrans(), partition(), replace(), rfind(), rindex(), rjust(), rpartition(), rsplit(), rstrip(), split(), splitlines(), startswith(), strip(), swapcase(), title(), translate(), upper(), zfill()

link

Avernar 3213 days ago

I didn't realize the modulo operator for bytes was added. most information I've run across said it didn't work.

Unfortunately most libraries for 3 will be using str so using bytes with UTF-8 inside will become more and more difficult.

link

masklinn 3212 days ago

> I didn't realize the modulo operator for bytes was added. most information I've run across said it didn't work.

It was added in Python 3.5 (IIRC that's the last backwards compatibility feature added, I don't remember 3.6 adding any, or any being planned for 3.7).

link

josteink 3213 days ago

> The library just has to treat it as a string and not worry about the encoding (i.e. not try to encode it to/from the unicode type).

If I pass a library a string it receives a Unicode string, bytes already decoded using an encoding. It shouldn't be able to re-decode that in any way, whatever that is supposed to mean on a technical level.

If a library receives a byte-array representing text, that is a completely different matter and talking about encodings is fully appropriate, even required.

But this matter should predominantly exist at your application's barrier, when doing IO.

If you're regularly doing encoding and decoding anywhere else, you're doing something wrong (or your language is).

link

Avernar 3213 days ago

Look back a few posts. We're discussing using UTF-8 in str and avoiding the unicode type in Python 2.

I'n my use case I validate the string as UTF-8 from the internet. To and from the database is UTF-8 so no validation is required there. Output back to the internet requires no additional steps.

Nowhere in this method is encode or decode required or desired.

link

int_19h 3212 days ago

> Not exactly. The library just has to treat it as a string and not worry about the encoding (i.e. not try to encode it to/from the unicode type).

Or do anything else that implies encoding. Like measure length, index, slice, change case etc.

link

Avernar 3211 days ago

Change case, yes, that would require actually decoding the string to the unicode type. But that could be done when needed and not every time something from my databse needs to go out to the client.

Slicing works fine on a UTF-8 string as I'm slicing between ASCII characters which don't appear inside a non ASCII character. If I needed to slice between certain code points it would still be easy as I just look for the appropriate 2-4 byte sequence and slice before or after it. Python doesn't support graphemes so can't do much with those.

Measuring length is not something that comes up for me. And indexing to an absolute spot in a string never comes up at all.

But yes, if I did have to call a text processing library I'd have to then encode/decode to the Unicode type. But that's rare enough that I can keep everything UTF-8.

link