Hacker News new | ask | show | jobs
by lagom 4723 days ago
Aside from API, there's a convincing performance argument to have len() return the byte count rather than number of utf8 characters.

The implementation of strings in Go is a 2-word struct containing a pointer to the start of the string and the length (in bytes). Under this implementation, len(s) is O(1) and RuneCountInString(s) is O(n). It makes sense to have the default case also be the fast one, particularly since people appreciate Go for its performance.

Alternatively, you could store the rune-count in the 2-word struct to reverse the above runtimes. However, this is detrimental for the common operation of converting between []byte/string as well as writing a string to a buffer. Both of those operations are a simple memcpy with the actual Go implementation, but would be O(n) using this alternate implementation.

Perhaps you could make it a 3-word struct that contains both byte-length and rune-length; Then all strings take up additional memory as well as requiring more overhead when used as function arguments.

1 comments

Why not a 2-word struct with the rune count instead of the byte count? There's zero performance cost for many strings because the rune count is known at compile time. For the rest, most strings are too short for Big-O analysis to be relevant and I would guess (enlighten me if I'm wrong) that the cost of computing the bounds of each character is negligible on a modern processor. Multi-byte chars in a string are going to be adjacent in memory, adjacent in cache, and therefore trivial for today's not-at-all-instruction-bound CPUs. Again, correct me if I'm wrong.
It is very seldom that you really want to deal with a string as an array of runes. (If actually you do want to, Go makes it fairly easy: Just use []rune rather than string.)

Consider a simple string: "école". How many runes does it contain? Possibly five:

    LATIN SMALL LETTER E WITH ACUTE
    LATIN SMALL LETTER C
    LATIN SMALL LETTER O
    LATIN SMALL LETTER L
    LATIN SMALL LEtTER E
Possibly six:

    LATIN SMALL LETTER E
    COMBINING ACUTE ACCENT
    LATIN SMALL LETTER C
    LATIN SMALL LETTER O
    LATIN SMALL LETTER L
    LATIN SMALL LEtTER E
If you normalize the string you can guarantee you have the first form, but not every glyph can be represented as a single rune.

Fortunately, you generally don't need to deal with any of this. If you're working with filenames, for example, you really only care about the path separator ('/' or '\' or whatever); everything else is just a bunch of opaque data. You can write a perfectly valid function to split a filename into components without understanding anything about combining characters. When you're dealing with data in this fashion, you rarely if ever care about the number of runes in a string; instead you care about the position of specific runes.

Thank you for the explanation! Converting to a rune slice and back does give me the behavior that I wanted. It still looks butt ugly to me, but at least it works.

In Go:

    fmt.Printf("%s", string([]rune("нєℓℓσ")[1:4]))
    // єℓℓ
In Python:

    print("нєℓℓσ"[1:4])
    # єℓℓ