|
|
|
|
|
by n1ghtm4n
4723 days ago
|
|
> What happens when you take a slice of a string? What happens when you slice a unicode string in Go is that it cuts multi-byte characters right in half, unless you get the byte boundaries just right. I know real programmers keep the byte boundaries for all the chars in all their strings in their head at all times, but for people like me this basically makes string slicing unusable for non-ASCII text. Python somehow magically slices unicode strings without chopping characters in half. In Python: s = "нєℓℓσ"
s[1:4] # "єℓℓ"
In Go: s[1:4] // "�є"
|
|
You need a byte offset to slice a string, and it's impossible to convert from a Unicode rune offset to a byte offset without parsing the entire string up until that point. I'm not all that familiar with Python, but if the language works as you implied, it is basically doing this behind the scenes in common string processing tasks:
Sure, a Python implementation could statically optimize this, but... why should it have to in the first place? That's fucking stupid and should be considered a language bug when it could be doing this: >I know real programmers keep the byte boundaries for all the chars in all their strings in their head at all timesWhen the hell would you have to remember the byte or rune boundaries for characters in the first place? Why would you be slicing up a string with magic number indices? If you're getting indices from pattern matching functions, you shouldn't care whether they're in bytes or bits or nibbles, you should just be passing them on to your language's split routines (or whatever else you wanted to do). Unless, of course, you're the one actually writing low-level string processing routines, in which case rune offsets are far less useful than byte offsets for the reason explained above.
This Python "feature" seems to exist entirely to keep newbies from getting confused when they attempt to slice up strings in their REPL, for I cannot fathom a reason why anyone would write "s[1:4]" in production code. IIRC, Python was designed for pedagogy, so I'm not surprised that it would take on such a pointless implementation cost just to spare teachers from explaining why "s[1:4] gave me question marks"