Hacker News new | ask | show | jobs
by millstone 3206 days ago
In practice UTF-8 has done more to enable the wrong thing, rather than forcing programmers to do the right thing.

> You can't really index Unicode characters like ASCII strings

But then why do strings-are-UTF8 languages like Go or D make it so easy? Why optimize your syntax with `len(txt)` and `txt[0]` when, as you point out, you can't do that? Why make it trivial to split code points or composed character sequences, but doing something like proper string truncation is brutally hard?

UTF-8's fail-fast property has not enabled more Unicode-savviness. Instead it just lets programmers pretend that we still are in the land of C strings.

2 comments

> Why optimize your syntax with `len(txt)` and `txt[0]` when, as you point out, you can't do that?

I like Rust's approach. It's a strings-are-UTF8 language but strings (both str and String):

- are not directly indexable

- force you to be explicit when iterating: you iterate over either `s.chars()` or `s.bytes()`

- are called out in the docs as being a vector of unsigned 8-bit integers internally

- support a len() method that is called out as returning the length of that vector

- can be sliced if you reaaaally need to get around inability to index directly but attempting to slice in the middle of a character causes a panic

> - support a len() method that is called out as returning the length of that vector

They should have called that one bytelen() then.

And how do you get a proper offset for slicing? Do you then have to interpret the UTF-8 bytes yourself, or can you somehow get it via the chars() iterator or something similar?

Yeah this is the way to go for sure.
> But then why do strings-are-UTF8 languages like Go

To clarify: strings in Go are not necessarily UTF-8. String literals will be, because the source code is defined to be UTF-8, but strings values in Go can contain any sequence of bytes: https://blog.golang.org/strings

Note that this prints 2, because the character contains two bytes in UTF-8, even though the two bytes correspond to one codepoint: https://play.golang.org/p/BqGzW1O2WX

Go also has the concept of a rune, which is separate from a byte and a string, and makes this easier when you're working with raw string encodings.

This makes it sound like Go is even more confused. If strings in Go are not necessarily UTF-8, why does the strings package assume UTF-8, `for range` assumes UTF-8, etc?
> If strings in Go are not necessarily UTF-8, why does the strings package assume UTF-8, `for range` assumes UTF-8, etc?

The blog post I linked to explains this in more detail, but in short: the `strings` package provides essentially the same functions as the `bytes` package does, except applied to work on UTF-8 strings. There are other packages for dealing with other text encodings.

The `for range` syntax is the one "special case", and it was done because the alternative (having it range over bytes instead of codepoints) is almost never desirable in practice[0], and it's easier to manually iterate the few times you do need it than it it would be to import a UTF-8 package just to iterate over a string 99.9% of the time.

[0] iterating over bytes is done all the time, of course, but usually at that point you're dealing with an actual slice of bytes already that you want to iterate over, not a string.

The point is that Go lumps together byte arrays and strings. It's a common flaw, but it's really unfortunate to see it perpetrated in a language that was designed after this lesson was already learned.

A byte array is a representation of a string, for sure. But strings themselves are higher-level abstractions. It shouldn't be that easy to mix the two.

An equivalent situation would be if integers were byte arrays. So len(x) would give you 4, for example, and you could do x[0], x[1] etc - except you would almost never actually do that in practice, and occasionally you'd end up doing the wrong thing by mistake.

If any language actually worked that way, everyone would be up in arms about it. Unfortunately, the same passes for strings, because of how conditioned we are to treat them as byte sequences.

Calling it "char" in C was probably the second million dollar mistake in the history of PL design, right after null.

Easily moving from bytes to strings and back is the only way it makes sense for Go. It runs on POSIX for the most part, and every. single. POSIX. API. is done in bytes. Not Unicode. Bytes.

Languages like Python 3 that try to be so Unicode-pure that they crash or ignore legal Linux filenames are insane.

I would dare say that the fact that Linux filenames don't have to be valid strings (i.e. they can be arbitrary byte sequences that cannot be meaningfully interpreted using the current locale encoding) is the insane part.

But does POSIX require support for arbitrary byte sequences in filenames, or does it merely use bytes (in locale encoding) as part of its ABI? I suspect the latter, since OS X is Unix-certified, and IIRC it does use UTF-16 for filenames on HFS - so presumably their POSIX API implementation maps to that somehow. If that's correct, then that's also the sane way forward - for the sake of POSIX compatibility, use byte arrays to pass strings around, but for the sake of sanity, require them to be valid UTF-8.