|
|
|
|
|
by Avernar
3209 days ago
|
|
I'm not a fan of how Python 3 stores Unicode strings internally. In my opinion they should have went with UTF-8. The extra scanning and conversion puts more preassure on the processor and caches under load. I agree that Python 2's Unicode handling is broken. That's why I just stored UTF-8 in a normal string and avoided the whole mess. The only thing I have to do is validate any input from the outside world is really UTF-8. |
|
And the vast majority of strings in real-world Python contain only code points also present in latin-1, which means they can be stored in one byte per code point with this approach. And for strings which can't be stored in one byte per code point, you were similarly going to pay the price sooner or later.