Hacker News new | ask | show | jobs
by int_19h 3403 days ago
> The general consensus on how to handle this correctly is to make everything a bytestring and assume the encoding as being UTF8 by default.

Whose "general consensus"? It's certainly not the approach adopted by the vast majority of mainstream general-purpose programming languages out there.

1 comments

You mean most of the mainstream, genpurpose languages in usage today which were initially designed 20-50 years ago. And those which don't actively look for opportunities to shoot themselves in the face with sidestepping changes to break the language? Then no, not that consensus.

But if you were to find any sensible designer of new language with skyrocketing, runaway popularity- such as Rob Pike, they'll tell you differently. Or even Guido van Rossum, if he were to be honest with you. While Pike's colleagues at Bell Labs like Dennis Ritchie may not have designed C this way for obvious reasons, they did design Go that way.

So now it's the consensus of "sensible designers of new languages". Where "sensible" is very subjective, and I have a feeling that your definition of it would basically presuppose agreeing with your conceptual view of strings, begging the question.

Aside from Go, can you give any other examples? Swift is obviously in the "new language" category (more so than Go, anyway), and yet it didn't go down that path.

Well, do your research and come to your own conclusions. Most people are going to agree that UTF8 is the way to go. You can advocate something else, since you seem to take affront to my opposition to Python3's Microsoft-oriented implementation.

If you know anything about Swift, it was designed with a primary goal to being a smooth transition from and interop with ObjC so like other legacy implementations (such as CPython3), it had sacrifices that limited how forward-looking it could be.

I'm not at all opposed to UTF-8 as internal encoding for strings. But that's completely different and orthogonal to what you're talking about, which is whether strings and byte arrays should be one and the same thing semantically, and represented by the same type, or by compatible types that are used interchangeably.

I want my strings to be strings - that is, an object for which it is guaranteed that enumerating codepoints always succeeds (and, consequently, any operation that requires valid codepoints, like e.g. case-insensitive comparison, also succeeds). This is not the case for "just assume a bytearray is UTF-8 if used in string context", which is why I consider this model broken on the abstraction level. It's kinda similar to languages that don't distinguish between strings and numbers, and interpret any numeric-looking string as number in the appropriate context.

FWIW, the string implementation in Python 3 supports UTF-8 representation. And it could probably be changed to use that as the primary canonical representation, only generating other ones if and when they're requested.

A default UTF8 string-type has to be allowed to be used interchangeably with bytestrings since ASCII is a valid subset. Your string type shouldn't be spellchecking nor checking for complete sentences either. What comes in, comes in. Validate it elsewhere.

Thus Go's strings don't potentially fail your desire for a guarantee anymore than anything else would assuming UTF8. They're unicode-by-default, which was the whole point to Python3 but Go has it too in a more elegant way. That's the beauty to UTF8 by default, you can pass it valid UTF8 or ASCII since it's a subset, the context of which it's being received is up to you. If you're expecting bytes it works, if you're expecting unicode codepoints that works. There's no reason to get your hands dirty with encodings unless you need to decode UTF16 etc first. If there is still a concern about data validation, that's up to you not your string type to throw an exception.

> A default UTF8 string-type has to be allowed to be used interchangeably with bytestrings since ASCII is a valid subset.

This only works in one direction. Sure, any valid UTF-8 is a bytestring. But not every bytestring is valid UTF-8. "Use interchangeably" implies the ability to substitute in both directions, which violates LSP.

> What comes in, comes in. Validate it elsewhere.

I have a problem with that. It's explicitly against the fail-fast design philosophy, whereby invalid input should be detected and acted upon as early as possible. First, because failing early helps identify the true origin of the error. And second because there's a category of subtle bugs where invalid input can be combined or processed in ways that make it valid-but-nonsensical, and as a result there are no reported errors at all, just quiet incorrect behavior.

Any language that has Unicode strings can handle ASCII just fine, since ASCII is a strict subset of UTF-8 - that doesn't require the encoding of the strings to be UTF-8. For languages that use a different encoding, it would mean that ASCII gets re-encoded into whatever the language uses, but this is largely an implementation detail.

Of course, if you're reading a source that is not necessarily in any Unicode encoding (UTF-8 or otherwise), and that may be non-string data, and you just need to pass the data through - well then, that's exactly what bytestrings are there for. The fact that you cannot easily mix them with strings (again, even if they're UTF-8-encoded) is a benefit in this case, because such mixing only makes sense if the bytestring is itself properly encoded. If it's not, you just silently get garbage. Using two different types at the input/output boundary makes it clear what assumptions can be made about every particular bit of input.