Hacker News new | ask | show | jobs
by int_19h 3403 days ago
I'm not at all opposed to UTF-8 as internal encoding for strings. But that's completely different and orthogonal to what you're talking about, which is whether strings and byte arrays should be one and the same thing semantically, and represented by the same type, or by compatible types that are used interchangeably.

I want my strings to be strings - that is, an object for which it is guaranteed that enumerating codepoints always succeeds (and, consequently, any operation that requires valid codepoints, like e.g. case-insensitive comparison, also succeeds). This is not the case for "just assume a bytearray is UTF-8 if used in string context", which is why I consider this model broken on the abstraction level. It's kinda similar to languages that don't distinguish between strings and numbers, and interpret any numeric-looking string as number in the appropriate context.

FWIW, the string implementation in Python 3 supports UTF-8 representation. And it could probably be changed to use that as the primary canonical representation, only generating other ones if and when they're requested.

1 comments

A default UTF8 string-type has to be allowed to be used interchangeably with bytestrings since ASCII is a valid subset. Your string type shouldn't be spellchecking nor checking for complete sentences either. What comes in, comes in. Validate it elsewhere.

Thus Go's strings don't potentially fail your desire for a guarantee anymore than anything else would assuming UTF8. They're unicode-by-default, which was the whole point to Python3 but Go has it too in a more elegant way. That's the beauty to UTF8 by default, you can pass it valid UTF8 or ASCII since it's a subset, the context of which it's being received is up to you. If you're expecting bytes it works, if you're expecting unicode codepoints that works. There's no reason to get your hands dirty with encodings unless you need to decode UTF16 etc first. If there is still a concern about data validation, that's up to you not your string type to throw an exception.

> A default UTF8 string-type has to be allowed to be used interchangeably with bytestrings since ASCII is a valid subset.

This only works in one direction. Sure, any valid UTF-8 is a bytestring. But not every bytestring is valid UTF-8. "Use interchangeably" implies the ability to substitute in both directions, which violates LSP.

> What comes in, comes in. Validate it elsewhere.

I have a problem with that. It's explicitly against the fail-fast design philosophy, whereby invalid input should be detected and acted upon as early as possible. First, because failing early helps identify the true origin of the error. And second because there's a category of subtle bugs where invalid input can be combined or processed in ways that make it valid-but-nonsensical, and as a result there are no reported errors at all, just quiet incorrect behavior.

Any language that has Unicode strings can handle ASCII just fine, since ASCII is a strict subset of UTF-8 - that doesn't require the encoding of the strings to be UTF-8. For languages that use a different encoding, it would mean that ASCII gets re-encoded into whatever the language uses, but this is largely an implementation detail.

Of course, if you're reading a source that is not necessarily in any Unicode encoding (UTF-8 or otherwise), and that may be non-string data, and you just need to pass the data through - well then, that's exactly what bytestrings are there for. The fact that you cannot easily mix them with strings (again, even if they're UTF-8-encoded) is a benefit in this case, because such mixing only makes sense if the bytestring is itself properly encoded. If it's not, you just silently get garbage. Using two different types at the input/output boundary makes it clear what assumptions can be made about every particular bit of input.

I understand your position. This is a longstanding debate within the PL community, as you know. I've considered that stance before but I have to say thanks for stating it so well because it's given me pause. I can't say I agree but you're not "wrong". I agree with the fail-fast concept but disagree that's the place to do it. A strict fail-fast diehard probably shouldn't even be using Python, IMO. I don't have anything else useful to add because this is simply a engineering design philosophy difference. We both think our own conclusions are better of course but I appreciate you detailing yours. Good chat, upvoted.
I agree that this is a difference caused by different basic premises, and irreconcilable without surrendering one of those premises.

And yes, you're right that Python is probably not a good example of ideologically pure fast-fail in general, simply because of its dynamic typing - so that part of argument is not really strong in that particular context.

(Side note: don't you find it amusing that between the two languages that we're discussing here, the one that is statically typed - Go - chose to be more relaxed about its strings in the type system, while the one that is dynamically typed - Python - chose to be more strict?)

The key takeaway, I think, is that there is no general consensus on this. As you say, "this is a longstanding debate within the PL community" (and not just the topics we touched upon, but the more broad design of strings - there's also the Ruby take on it, for example, where string is bytes + encoding).

My broader take, encoding aside, is that strings are actually not sufficiently high-level in most modern languages (including Python). I really want to get away from the notion of string as a container of anything, even Unicode code points, and instead treat it as an opaque representation of text that has various views exposing things like code points, glyphs, bytes in various encodings etc. But none of those things should be promoted as the primary view of what a string is - and, consequently, you shouldn't be able to write things like `s[i]` or `for ch in s`.

The reason is that I find that the ability to index either bytes or codepoints, while useful, has this unfortunate trait that it often gives right results on limited inputs (e.g. ASCII only, or UCS2 only, or no combining characters) while being wrong in general. When it's accessible via shorthand "default" syntax that doesn't make it explicit, people use it because that's what they notice first, and without thinking about what their mode of access implies. Then they throw inputs that are common for them (e.g. ASCII for Americans, Latin-1 for Western Europeans), observe that it works correctly, and conclude that it's correct (even though it's not).

If they are, instead, forced to explicitly spell out access mode - like `s.get_code_point(i)` and `for ch in s.code_points()` - they have to at least stop and think what a "code point" even is, how it's different from various other explicit options that they have there (like `s.get_glyph(i)`), and which one is more suitable to the task at hand.

And if we assume that all strings are sequences of bytes in UTF-8, the same point would also apply to those bytes - i.e. I'd still expect `s[i]` to not work, and having to write something like `s.get_byte(i)` or `for b in s.bytes()` - for all the same reasons.

It's worth nothing that I think Python's success and simplicity came from the ease of use and flexibility. It should have Go's strings. You have a point there. The type annotations are really jumping the shark too, it's just no longer the language that once made so much sense.

On the consensus on bytestrings assumed as UTF8, there's only 1 new language without legacy baggage that has skyrocketing popularity and it has bytestrings assumed as UTF8. Everyone I've surveyed says that's no coincidence, including members of the Python core dev team. So that's where I'm seeing consensus on the string issue. While Python3 has struggled and folks are rightly irritated. Because some were irresponsible everyone has to pay, that's not really how I viewed Python's strengths prior to 3.x. Zed said it best but it really was one of the best examples of how dynamic typing can be fun.