|
|
|
|
|
by lucb1e
781 days ago
|
|
> `str` are sequence of unicode codepoints [...] without reference to any encoding I guess I see it from the programmer's perspective: to handle bytes coming from the disk/network as a string, I need to specify an encoding, so they are (to me) byte sequences with an encoding assigned. Didn't realize strings don't have an encoding in Python's internal string handling but are, instead, something like an array of integers pointing to unicode code points. Not sure if this viewpoint means I am getting it backwards but I can see how that was phrased poorly on my part! |
|
1. Interface: How can I interact with "string" values, what kind of operations can I perform versus what can't be done ? Methods and Operators provided go here.
2. Representation: What is actually stored (in memory) ? Layout goes here.
So you may have understood (1) for Python, but you were badly off on (2). Now, at some level this doesn't matter, but, for performance obviously the choice of what you should do will depend on (2). Most obviously, if the language represents strings as UTF-8 bytes, then "encoding" a string as UTF-8 will be extremely cheap. Whereas, if the language represents them as UTF-16 code units, the UTF-8 encoding operation will be a little slower.