| HN Mirror

Y	Hacker News new \| ask \| show \| jobs

by int_19h 3209 days ago

The point is that Go lumps together byte arrays and strings. It's a common flaw, but it's really unfortunate to see it perpetrated in a language that was designed after this lesson was already learned.

A byte array is a representation of a string, for sure. But strings themselves are higher-level abstractions. It shouldn't be that easy to mix the two.

An equivalent situation would be if integers were byte arrays. So len(x) would give you 4, for example, and you could do x[0], x[1] etc - except you would almost never actually do that in practice, and occasionally you'd end up doing the wrong thing by mistake.

If any language actually worked that way, everyone would be up in arms about it. Unfortunately, the same passes for strings, because of how conditioned we are to treat them as byte sequences.

Calling it "char" in C was probably the second million dollar mistake in the history of PL design, right after null.

1 comments

zlynx 3209 days ago

Easily moving from bytes to strings and back is the only way it makes sense for Go. It runs on POSIX for the most part, and every. single. POSIX. API. is done in bytes. Not Unicode. Bytes.

Languages like Python 3 that try to be so Unicode-pure that they crash or ignore legal Linux filenames are insane.

int_19h 3208 days ago

I would dare say that the fact that Linux filenames don't have to be valid strings (i.e. they can be arbitrary byte sequences that cannot be meaningfully interpreted using the current locale encoding) is the insane part.

But does POSIX require support for arbitrary byte sequences in filenames, or does it merely use bytes (in locale encoding) as part of its ABI? I suspect the latter, since OS X is Unix-certified, and IIRC it does use UTF-16 for filenames on HFS - so presumably their POSIX API implementation maps to that somehow. If that's correct, then that's also the sane way forward - for the sake of POSIX compatibility, use byte arrays to pass strings around, but for the sake of sanity, require them to be valid UTF-8.