| HN Mirror

Y	Hacker News new \| ask \| show \| jobs


	by captaincrowbar 3251 days ago
	This sort of thing is why Swift treats grapheme clusters, rather than code points or bytes or "characters", as the fundamental unit of text. When I first started learning Swift I thought that was a weird choice that would just get in the way, but these days I'm coming around to their way of thinking.

3 comments

hsivonen 3251 days ago

Treating grapheme clusters as fundamental is slightly problematic in the sense that then the fundamentals change as Unicode adds more combining characters. A reasonable programming environment should still provide iteration by grapheme cluster as a library feature whose exact behavior is expected to change over time as the library tracks new Unicode versions.

Depending on the task at hand, iterating by UTF-8 byte or by code point can make sense, too. And the definition of these is frozen regardless of Unicode version, which makes these safer candidates for "fundamental" operations. There is no right unit of iteration for all tasks.

link

Retra 3251 days ago

Pfft, that is just as bad. There is no 'fundamental unit of text'. There are different units of text that are appropriate to different tasks.

If I want to know how much memory to allocate, bytes are it. If I want to know how much screen space to allocate, font rendering metrics are it. If I want to do word-breaking, grapheme clusters are it.

None of these are fundamental.

link

johncolanduoni 3251 days ago

Since a string doesn't have any font rendering metrics (in that it lacks a font or size), I'm not sure how you expect a language's String implementation to take it into account. Similarly, bytes will change based on encodings, which most people would expect a language's String type to abstract over. Do you really want UTF8 and UTF16 strings to behave differently and introduce even more complexity to a very complex system?

There are languages whose orthographies don't fit the Unicode grapheme cluster specification, but they're complex enough that I doubt there's any way to deal with them properly other than having someone proficient in them looking over your text processing or pawning it off to a library. At least with grapheme clusters your code won't choke on something as simple as unnormalized Latin text.

link

int_19h 3251 days ago

It's not about encodings at all, actually. It's about the API that is presented to the programmer.

And the way you take it all into account is by refusing to accept any defaults. So, for example, a string type should not have a "length" operation at all. It should have "length in code points", "length in graphemes" etc operations. And maybe, if you do decide to expose UTF-8 (which I think is a bad idea) - "length in bytes". But every time someone talks about the length, they should be forced to specify what they want (and hence think about why they actually need it).

Similarly, strings shouldn't support simple indexing - at all. They should support operations like "nth codepoint", "nth grapheme" etc. Again, forcing the programmer to decide every time, and to think about the implications of those decisions.

It wouldn't solve all problems, of course. But I bet it would reduce them significantly, because wrong assumptions about strings are the most common source of problems.

link

Avernar 3251 days ago

What do you mean by "expose UTF-8"? Because nothing about UTF-8 requires that you give byte access to the string.

As for indexing, strings shouldn't require indexing period. That's the ASCII way of thinking, especially fixed width columns and such. You should be thinking relatively. For example, find me the first space then using that point in the string the next character needs to be letter. When you build you're code that way you don't fall for the trap of byte indexing or the performance hit of codepoint indexing (UTF-8) or grapheme indexing (all encodings).

link

ubernostrum 3251 days ago

There are real-world textual data types for which your idealized approach simply does not work. As in, it would be impossible or impossibly unwieldy to validate conformance to the type using your approach, because they require indexing to specific locations, or determining length, or both.

For example, I work for a company that does business in the (US) Medicare space. Every Medicare beneficiary has a HICN -- Health Insurance Claim Number -- and HICNs come in different types which need to be identified. Want to know how to identify them? By looking at prefix and suffix characters in specific positions, and the length of what comes between them. For example, the prefix 'A' followed by six digits means the person identified is the primary beneficiary and was first covered under the Railroad Retirement Board benefit program. Doing this without indexing and length operations is madness.

These data types can and should be subjected first to some basic checks to ensure they're not nonsense (i.e., something expected to be a numeric value probably should not contain Linear B code points, and it's probably a good idea to at least throw a regex at it first, but then applying regex to Unicode also has quirks people don't often expect at first...).

link

Avernar 3250 days ago

I don't see why this would be hard with iterators. You have an iterstor to the start of the HICN, either at the start of a or deep in the string. Take a second iterator and set it to the first. Loop six times advancing that iterator checking to see if it's a digit. Then check if the next position is a space.

For the prefix and suffix and how many characters between them you do the above but use the second iterator to find the suffix. Then you either keep track of how many characters you advanced or ask for how many characters between the two.

It's very easy to think about it this way as that's how a normal (non programmer) human would do it. Basically the code literally does what you wrote in english above.

My point being is that iterators are much faster than indexing when the underlying string system uses graphemes. You can do pretty much anyting just as easy or easier with iterators than with indexing. The big exception is fixed width columnar tet files. I've seen a lot of these in financial situations but fortuanately those systems are ASCII based so not an issue.

link

mjevans 3250 days ago

int_19h's approach is still valid for this; you're asking for whole displayed characters which are combined of some (you don't need to know) number of bits in memory across several units of the memory segment(s) that hold the string.

Based on your description, the correct solution is probably to use a structure or class of a more regular format to store the decoded HICN in pre-broken form. If they really only allow numbers in runs of text you might save space and speed comparison/indexing by doing this.

link

int_19h 3249 days ago

By "expose UTF-8" I mean exposing the underlying UTF-8 representation of the string directly on the object itelf, instead of going through a separate byte array (or byte array view, to avoid copying)

link

Avernar 3248 days ago

Ah, I see. I agree that it would be a bad idea to give acess to the UTF-8 representstion.

As for length in bytes, a good way to handle most use cases regarding that is to have a function that truncates the string to fit into a certain number of bytes. That way you can make sure it fits into whatever fixed buffer and the truncation would happen on a grapheme level.

link

saagarjha 3250 days ago

This is exactly how Swift’s Strings work.

link

coldtea 3250 days ago

>If I want to know how much memory to allocate, bytes are it. If I want to know how much screen space to allocate, font rendering metrics are it. If I want to do word-breaking, grapheme clusters are it.

Size in memory/bytes you could get trivially for any string (and this doesn't change with whether you chose bytes, graphemes or code points or whatever to iterate).

Screen space is irrelevant/orthogonal to encoding -- it appears at the font level and the font rendering engine that will give the metrics will accept whatever encoding it is.

link

Retra 3250 days ago

>Screen space is irrelevant/orthogonal to encoding

Exactly. That's why measurements of string length shouldn't ever assume I'm looking for a unit-of-offset for a monospaced font.

The problem is that most naive programmers think that's what a string length is and should be.

link

ubernostrum 3251 days ago

I would love it if Python at least would support the '\X' regex metacharacter in its own built-in regex module. Right now you have to turn to a third-party implementation to get that.

I also wish Python would expose more of the Unicode database than it does; I've had to turn to third-party modules that basically build their own separate database for some Unicode stuff Python doesn't provide (like access to the Script property).

link