| HN Mirror

Y	Hacker News new \| ask \| show \| jobs


	by skizm 4771 days ago
	I think the majority of the time you want the number of utf8 characters and not the byte count. In fact I have never wanted the byte count. If I did I would expect something like byteLen and Len. Not the other way around. You should be optimizing the common case, not the exception. Obviously I'm not a language designer so perhaps I'm talking out of my ass but I've heard this complaint A LOT.

3 comments

callenish 4771 days ago

Have you never had to indicate how many bytes you are sending over a stream, say in the Content-Length of an HTTP response? Have you never put strings into a byte buffer?

But that doesn't matter. Let's say you are correct: when working with strings, you more often want the rune length. It still wouldn't be the right decision, given the other design decisions of Go, because it would have needlessly complicated things with only arguable benefits. Let me show you what I mean.

The len() function works with a whole lot of things: strings, arrays, slices, maps and channels. For the first three, len() returns the number of bytes involved. This is because all three are backed by an array, and so sensibly have similar semantics. It would have violated the principal of least surprise for anyone who knew the language to have an array-backed storage not return a byte count. Both the language developers and the users of it would have to special-case strings, in code and in their brains.

Now, they could have decided to do it anyway, but then another surprise awaits. What happens when you take a slice of a string? Oh no, more special casing and more complication for everyone.

The Go developers do special-case where doing so would clearly be a win for their users. Consider range, which iterates by runes over a string, potentially moving the index on the underlying array forward by more than 1 on each pass. That is clearly going to be the most common usecase the user is going to want and so was worth doing. It also eliminates many of the usecases where getting the length of a string in runes would matter to you. Not all, but a lot.

link

n1ghtm4n 4770 days ago

> What happens when you take a slice of a string?

What happens when you slice a unicode string in Go is that it cuts multi-byte characters right in half, unless you get the byte boundaries just right. I know real programmers keep the byte boundaries for all the chars in all their strings in their head at all times, but for people like me this basically makes string slicing unusable for non-ASCII text.

Python somehow magically slices unicode strings without chopping characters in half.

In Python:

    s = "нєℓℓσ"
    s[1:4]  # "єℓℓ"

In Go:

    s[1:4] // "�є"

link

ANTSANTS 4770 days ago

>Python somehow magically slices unicode strings without chopping characters in half.

You need a byte offset to slice a string, and it's impossible to convert from a Unicode rune offset to a byte offset without parsing the entire string up until that point. I'm not all that familiar with Python, but if the language works as you implied, it is basically doing this behind the scenes in common string processing tasks:

  1. The user uses some kind of pattern matching function or whatever to find where they want to split the string. Python returns a rune index.
  2. The user tells Python to go split apart that string along a rune index. It promptly begins parsing the string all over again until it finds the right byte boundaries.
  3. The language then actually creates the new string in between the byte boundaries.

Sure, a Python implementation could statically optimize this, but... why should it have to in the first place? That's fucking stupid and should be considered a language bug when it could be doing this:

  1. User pattern matches blah blah blah and gets a byte index.
  2. User tells their sane language to split the string apart at the byte index and it just does so.

>I know real programmers keep the byte boundaries for all the chars in all their strings in their head at all times

When the hell would you have to remember the byte or rune boundaries for characters in the first place? Why would you be slicing up a string with magic number indices? If you're getting indices from pattern matching functions, you shouldn't care whether they're in bytes or bits or nibbles, you should just be passing them on to your language's split routines (or whatever else you wanted to do). Unless, of course, you're the one actually writing low-level string processing routines, in which case rune offsets are far less useful than byte offsets for the reason explained above.

This Python "feature" seems to exist entirely to keep newbies from getting confused when they attempt to slice up strings in their REPL, for I cannot fathom a reason why anyone would write "s[1:4]" in production code. IIRC, Python was designed for pedagogy, so I'm not surprised that it would take on such a pointless implementation cost just to spare teachers from explaining why "s[1:4] gave me question marks"

link

n1ghtm4n 4770 days ago

You're right. You're not very familiar with Python. String slicing with numbered indices is used all the time. And you can slice more than just strings! It's one of the coolest features of Python and you're really missing out if your favorite language doesn't have that.

This might explain why my comments seem like heresy to you. I would point out that the OP is about Python programmers switching to Go.

link

ANTSANTS 4770 days ago

>String slicing with numbered indices is used all the time.

Ok, so it (EDIT FOR YOUR BENEFIT: I'm talking about slicing strings with rune indices here, not slicing in general. Array slicing is a useful language feature, and, uh, it's not unique to Python or anything) is not just a language wart, but a fertile source of pointless inefficiency in everyday Python code, glad to know.

>This might explain why my comments seem like heresy to you.

You aren't challenging my beliefs or anything, I'm just trying to make you see that you don't understand how UTF-8 string operations work very well. If you did, you'd understand that Python is doing the exact same thing as Go here, but in a less efficient manner.

link

n1ghtm4n 4770 days ago

>Ok, so it's not just a language wart, but a fertile source of pointless inefficiency in everyday Python code, glad to know.

The fact that it's used all the time would suggest it's not pointless inefficiency, no? Maybe you should try Python before bashing it.

>Python is doing the exact same thing as Go here, but in a less efficient manner.

It's not doing the same thing. "�є" is not the same as "єℓℓ".

link

dragonwriter 4770 days ago

> This Python "feature" seems to exist entirely to keep newbies from getting confused when they attempt to slice up strings in their REPL, for I cannot fathom a reason why anyone would write "s[1:4]" in production code.

Dealing with a format where data elements are defined to be fixed length in characters that happens to be encoded in Unicode?

link

neild 4770 days ago

> Python somehow magically slices unicode strings without chopping characters in half.

Well, that rather depends on what you mean by "character" and "in half".

    >>> s = u're\u0301sume\u0301'
    >>> print s
    résumé
    >>> len(s)
    8
    >>> for i in xrange(len(s)):
    ...     print s[i]
    ... 
    r
    e
    
    s
    u
    m
    e

    >>> print ' '.join(s)
    r e ́ s u m e ́

link

skizm 4770 days ago

> Have you never had to indicate how many bytes you are sending over a stream...

I have and I generally like the syntax to be more explicit (since I so rarely work in bytes): string.getBytes().length

You make all fair points, and I guess it is my opinion that varies then but I think the rule of least surprise would involve returning the rune count for both strings and string slices. Also, wow, range iterates over runes but len returns byte count. That's messed up.

link

lagom 4771 days ago

Aside from API, there's a convincing performance argument to have len() return the byte count rather than number of utf8 characters.

The implementation of strings in Go is a 2-word struct containing a pointer to the start of the string and the length (in bytes). Under this implementation, len(s) is O(1) and RuneCountInString(s) is O(n). It makes sense to have the default case also be the fast one, particularly since people appreciate Go for its performance.

Alternatively, you could store the rune-count in the 2-word struct to reverse the above runtimes. However, this is detrimental for the common operation of converting between []byte/string as well as writing a string to a buffer. Both of those operations are a simple memcpy with the actual Go implementation, but would be O(n) using this alternate implementation.

Perhaps you could make it a 3-word struct that contains both byte-length and rune-length; Then all strings take up additional memory as well as requiring more overhead when used as function arguments.

link

n1ghtm4n 4770 days ago

Why not a 2-word struct with the rune count instead of the byte count? There's zero performance cost for many strings because the rune count is known at compile time. For the rest, most strings are too short for Big-O analysis to be relevant and I would guess (enlighten me if I'm wrong) that the cost of computing the bounds of each character is negligible on a modern processor. Multi-byte chars in a string are going to be adjacent in memory, adjacent in cache, and therefore trivial for today's not-at-all-instruction-bound CPUs. Again, correct me if I'm wrong.

link

neild 4770 days ago

It is very seldom that you really want to deal with a string as an array of runes. (If actually you do want to, Go makes it fairly easy: Just use []rune rather than string.)

Consider a simple string: "école". How many runes does it contain? Possibly five:

    LATIN SMALL LETTER E WITH ACUTE
    LATIN SMALL LETTER C
    LATIN SMALL LETTER O
    LATIN SMALL LETTER L
    LATIN SMALL LEtTER E

Possibly six:

    LATIN SMALL LETTER E
    COMBINING ACUTE ACCENT
    LATIN SMALL LETTER C
    LATIN SMALL LETTER O
    LATIN SMALL LETTER L
    LATIN SMALL LEtTER E

If you normalize the string you can guarantee you have the first form, but not every glyph can be represented as a single rune.

Fortunately, you generally don't need to deal with any of this. If you're working with filenames, for example, you really only care about the path separator ('/' or '\' or whatever); everything else is just a bunch of opaque data. You can write a perfectly valid function to split a filename into components without understanding anything about combining characters. When you're dealing with data in this fashion, you rarely if ever care about the number of runes in a string; instead you care about the position of specific runes.

link

n1ghtm4n 4770 days ago

Thank you for the explanation! Converting to a rune slice and back does give me the behavior that I wanted. It still looks butt ugly to me, but at least it works.

In Go:

    fmt.Printf("%s", string([]rune("нєℓℓσ")[1:4]))
    // єℓℓ

In Python:

    print("нєℓℓσ"[1:4])
    # єℓℓ

link

Mr_T_ 4770 days ago

You almost always need the byte count for protocols, buffers and stuff. For user-interfacing use-cases (e.g. editors) you probably want the glyph or grapheme count, not the rune count. Please remember, runes / code points, graphemes and glyphs are different things:

http://www.icu-project.org/docs/papers/forms_of_unicode/

link