Hacker News new | ask | show | jobs
by ANTSANTS 4723 days ago
>Python somehow magically slices unicode strings without chopping characters in half.

You need a byte offset to slice a string, and it's impossible to convert from a Unicode rune offset to a byte offset without parsing the entire string up until that point. I'm not all that familiar with Python, but if the language works as you implied, it is basically doing this behind the scenes in common string processing tasks:

  1. The user uses some kind of pattern matching function or whatever to find where they want to split the string. Python returns a rune index.
  2. The user tells Python to go split apart that string along a rune index. It promptly begins parsing the string all over again until it finds the right byte boundaries.
  3. The language then actually creates the new string in between the byte boundaries.
Sure, a Python implementation could statically optimize this, but... why should it have to in the first place? That's fucking stupid and should be considered a language bug when it could be doing this:

  1. User pattern matches blah blah blah and gets a byte index.
  2. User tells their sane language to split the string apart at the byte index and it just does so.
>I know real programmers keep the byte boundaries for all the chars in all their strings in their head at all times

When the hell would you have to remember the byte or rune boundaries for characters in the first place? Why would you be slicing up a string with magic number indices? If you're getting indices from pattern matching functions, you shouldn't care whether they're in bytes or bits or nibbles, you should just be passing them on to your language's split routines (or whatever else you wanted to do). Unless, of course, you're the one actually writing low-level string processing routines, in which case rune offsets are far less useful than byte offsets for the reason explained above.

This Python "feature" seems to exist entirely to keep newbies from getting confused when they attempt to slice up strings in their REPL, for I cannot fathom a reason why anyone would write "s[1:4]" in production code. IIRC, Python was designed for pedagogy, so I'm not surprised that it would take on such a pointless implementation cost just to spare teachers from explaining why "s[1:4] gave me question marks"

2 comments

You're right. You're not very familiar with Python. String slicing with numbered indices is used all the time. And you can slice more than just strings! It's one of the coolest features of Python and you're really missing out if your favorite language doesn't have that.

This might explain why my comments seem like heresy to you. I would point out that the OP is about Python programmers switching to Go.

>String slicing with numbered indices is used all the time.

Ok, so it (EDIT FOR YOUR BENEFIT: I'm talking about slicing strings with rune indices here, not slicing in general. Array slicing is a useful language feature, and, uh, it's not unique to Python or anything) is not just a language wart, but a fertile source of pointless inefficiency in everyday Python code, glad to know.

>This might explain why my comments seem like heresy to you.

You aren't challenging my beliefs or anything, I'm just trying to make you see that you don't understand how UTF-8 string operations work very well. If you did, you'd understand that Python is doing the exact same thing as Go here, but in a less efficient manner.

>Ok, so it's not just a language wart, but a fertile source of pointless inefficiency in everyday Python code, glad to know.

The fact that it's used all the time would suggest it's not pointless inefficiency, no? Maybe you should try Python before bashing it.

>Python is doing the exact same thing as Go here, but in a less efficient manner.

It's not doing the same thing. "�є" is not the same as "єℓℓ".

> This Python "feature" seems to exist entirely to keep newbies from getting confused when they attempt to slice up strings in their REPL, for I cannot fathom a reason why anyone would write "s[1:4]" in production code.

Dealing with a format where data elements are defined to be fixed length in characters that happens to be encoded in Unicode?