Hacker News new | ask | show | jobs
by int_19h 3402 days ago
I agree that this is a difference caused by different basic premises, and irreconcilable without surrendering one of those premises.

And yes, you're right that Python is probably not a good example of ideologically pure fast-fail in general, simply because of its dynamic typing - so that part of argument is not really strong in that particular context.

(Side note: don't you find it amusing that between the two languages that we're discussing here, the one that is statically typed - Go - chose to be more relaxed about its strings in the type system, while the one that is dynamically typed - Python - chose to be more strict?)

The key takeaway, I think, is that there is no general consensus on this. As you say, "this is a longstanding debate within the PL community" (and not just the topics we touched upon, but the more broad design of strings - there's also the Ruby take on it, for example, where string is bytes + encoding).

My broader take, encoding aside, is that strings are actually not sufficiently high-level in most modern languages (including Python). I really want to get away from the notion of string as a container of anything, even Unicode code points, and instead treat it as an opaque representation of text that has various views exposing things like code points, glyphs, bytes in various encodings etc. But none of those things should be promoted as the primary view of what a string is - and, consequently, you shouldn't be able to write things like `s[i]` or `for ch in s`.

The reason is that I find that the ability to index either bytes or codepoints, while useful, has this unfortunate trait that it often gives right results on limited inputs (e.g. ASCII only, or UCS2 only, or no combining characters) while being wrong in general. When it's accessible via shorthand "default" syntax that doesn't make it explicit, people use it because that's what they notice first, and without thinking about what their mode of access implies. Then they throw inputs that are common for them (e.g. ASCII for Americans, Latin-1 for Western Europeans), observe that it works correctly, and conclude that it's correct (even though it's not).

If they are, instead, forced to explicitly spell out access mode - like `s.get_code_point(i)` and `for ch in s.code_points()` - they have to at least stop and think what a "code point" even is, how it's different from various other explicit options that they have there (like `s.get_glyph(i)`), and which one is more suitable to the task at hand.

And if we assume that all strings are sequences of bytes in UTF-8, the same point would also apply to those bytes - i.e. I'd still expect `s[i]` to not work, and having to write something like `s.get_byte(i)` or `for b in s.bytes()` - for all the same reasons.

1 comments

It's worth nothing that I think Python's success and simplicity came from the ease of use and flexibility. It should have Go's strings. You have a point there. The type annotations are really jumping the shark too, it's just no longer the language that once made so much sense.

On the consensus on bytestrings assumed as UTF8, there's only 1 new language without legacy baggage that has skyrocketing popularity and it has bytestrings assumed as UTF8. Everyone I've surveyed says that's no coincidence, including members of the Python core dev team. So that's where I'm seeing consensus on the string issue. While Python3 has struggled and folks are rightly irritated. Because some were irresponsible everyone has to pay, that's not really how I viewed Python's strengths prior to 3.x. Zed said it best but it really was one of the best examples of how dynamic typing can be fun.