| HN Mirror

Y	Hacker News new \| ask \| show \| jobs

by jcranmer 2813 days ago

> 1. it has proper, validated unicode strings (though the stdlib is not grapheme-aware so manipulating these strings is not ideal)

Grapheme clusters are overrated in their importance for processing. The list of times you want to iterate over grapheme clusters:

1. You want to figure out where to position the cursor when you hit left or right.

2. You want to reverse a string. (When was the last time you wanted to do that?)

The list of times when you want to iterate over Unicode codepoints:

1. When you're implementing collation, grapheme cluster searching, case modification, normalization, line breaking, word breaking, or any other Unicode algorithm.

2. When you're trying to break text into separate RFC 2047 encoded-words.

3. When you're trying to display the fonts for a Unicode string.

4. When you're trying to convert between charsets.

Cases where neither is appropriate:

1. When you want to break text to separate lines on the screen.

2. When you want to implement basic hashing/equality checks.

(I'm not sure where "cut the string down to 5 characters because we're out of display room" falls in this list. I suspect the actual answer is "wrong question, think about the problem differently").

Grapheme clusters is relatively expensive to compute, and its utility is very circumscribed. Iterating over Unicode codepoints is much more useful and foundational and yet still very cheap.

3 comments

zbentley 2813 days ago

> Grapheme clusters are overrated in their importance for processing. The list of times you want to iterate over grapheme clusters:

> 1. You want to figure out where to position the cursor when you hit left or right.

> 2. You want to reverse a string. (When was the last time you wanted to do that?)

You missed the big one:

3. You want to determine the logical (and often visual) length of a string.

Sure, there are some languages where logical-length is less meaningful as a concept, but there are many, many languages in which it's a useful concept, and can only be easily derived by iterating grapheme clusters.

link

anonymfus 2813 days ago

Visual length of a string is measured in pixels and millimetres, not characters. In a font/graphics library, not in a text processing one.

link

zbentley 2813 days ago

Sorry, visual length as in visual number of "character-equivalent for purposes of word length" things. Those things are close to, but not exactly the same as, grapheme clusters, so the latter can often be used as an imperfect (but much more useful than unicode points or bytes) proxy for the former.

There's no perfect representation of number-of-character-equivalents that doesn't require understanding of the language being handled (and it's meaningless in some languages as I said), but there are many written languages in which knowing the length in those terms is both extremely useful and extremely hard to do without grapheme cluster identification.

link

Groxx 2812 days ago

>character-equivalent for purposes of word length

Serious question: why would you want to do this?

I know it's fashionable to limit usernames to X characters... but why? The main reason I've seen has been to limit the rendered length so there are some mostly-reliable UI patterns that don't need to worry about overflows or multiple lines. At least until someone names themselves:

ＷＷＷＷＷＷＷＷＷＷＷＷＷＷＷＷＷＷＷＷ

Which is 20 characters, no spaces, and will break loads of things.

(I'm intentionally ignoring "db column size" because that depends on your encoding, so it's unrelated to graphemes)

link

ubernostrum 2812 days ago

Serious question: why would you want to do this?

Have you never, in your entire life, encountered a string data type with a length rule? All sorts of ID values (to take an obvious example) either have fixed length, or a set of fixed lengths such that every valid value is one of those lengths, and many are alphanumeric, meaning you cannot get round length checks by trying to treat them as integers. Validating/understanding these values also often requires identifying what code point, not what grapheme, is at a specific index.

Plus there are things like parsing algorithms for standard formats. To take another example: you know how people sometimes repost the Stack Overflow question asking why "chucknorris" turns into a reddish color when used as a CSS color value? HTML5 provides an algorithm for parsing a (string) color declaration and turning it into a 24-bit RGB color value. That algorithm requires, at times, checking the length in code points of the string, and identifying the values of code points at specific indices. A language which forbids those operations cannot implement the HTML5 color parsing algorithm (through string handling; you'd instead have to do something like turn the string into a sequence of ints corresponding to the code points, and then manually manage everything, and why do that to yourself?).

link

Groxx 2812 days ago

Yes. All instances I've seen have been due to byte-size restrictions (so it depends on encoding) or for visual reasons (based on fundamentally flawed assumptions). With exceptions for dubious science around word-lengths between languages / as a difficulty/intelligence proxy, or just having fun identifying patterns. (interesting, absolutely, but of questionable utility / bias)

But every example you've given have been around visuals, byte sizes, or code points (which are unambiguously useful, yes). Nothing about graphemes.

link

masklinn 2813 days ago

So?

Rust's stdlib provides iteration on code units and code points. The use cases where these are useful are covered.

It does not provide iteration on grapheme clusters, the use cases where this is useful are not covered (and require an external dependency).

At no point am I requesting replacing codepoints-wise iteration by clusters-wise iteration.

link

Manishearth 2813 days ago

I think a more accurate characterization is that neither code points nor grapheme clusters are usually what you want, but when you're naively processing text it's usually better to go with grapheme clusters so you don't mess up _as_ badly :)

There are definitely some operations that make sense on code points: but if you go through your list, (1), (2), (4) is something you'll rarely implement yourself (you just need a library), (3) is ... kinda rare? The most common valid use case for dealing with code points is parsing, where the grammar is defined in ascii or in terms of code points (which is relatively common).

Treating strings as either code points or graphemes basically enshrines the assumptions that segmentation operations make sense on strings at all -- they only do in specific contexts.

Most string operations you can think of come from incorrect assumptions about text. Like you said, the answer to most questions of the form "how do I X a string" is "wrong question" (reversing a string is my favorite example of this).

The only string operation that universally makes sense is concatenation (when dealing with "valid" strings, i.e. strings that actually make sense and don't do silly things like starting with a stray modifier character). Replacement makes some sense but you have to define "replacement" better based on your context. Taking substrings makes sense but typically only if you already have some metric of validity for the substring -- either that the substring was an ingredient of a prior concatenation, or that you have a defined text format like HTML that lets you parse out substrings. (This is why i actually kinda agree with Rust's decision to use bytes rather than code points for indexing strings -- if you're doing it right you should have obtained the offset from an operation on the string anyway, so it doesn't matter how you index it, so pick the fast one)

Most string operations go downhill from here where there's usually a right thing to do for that operation but it's highly context dependent.

Even hashing and equality are context-dependent, sometimes comparing bytes is enough, but other times you want to NFC or something and it gets messy quickly :)

In the midst of all this, grapheme clusters + NFC (what Swift does) are abstractions that let you naively deal with strings and mess up less. Your algorithm will still be wrong, but its incorrectness will cause fewer problems.

But yeah, you're absolutely right that grapheme clusters are pretty niche for when they're the correct tool to reach for. I'd just like to add that they're often the less blatantly incorrect tool to reach for :)

> (I'm not sure where "cut the string down to 5 characters because we're out of display room" falls in this list. I suspect the actual answer is "wrong question, think about the problem differently").

This is true, and not thinking about the problem differently is what caused the iOS Arabic text crash last year.

For many if not most scripts fewer code points is not a guarantee of shorter size -- you can even get this in Latin if you have a font with some wild kerning -- it's just that this is much easier to trigger in Arabic since you have some letters that have tiny medial forms but big final forms.

link

tialaramex 2813 days ago

There's a very sound argument to be made for the opposite conclusion, that if we care about a problem we should make it necessary to solve the problem correctly or else stuff very obviously breaks, not have broken systems seem like they kinda work until they're used in anger.

Outside of MySQL (which unaccountably had a weird MySQL-only character encoding which only covered the BMP and named it "utf8" then when you tried to shove actual UTF-8 strings into it, they'd get silently truncated because YOLO MySQL) UTF-8 implementations tended to handle the other planes much better than UTF-16 implementations, many of which were in practice UCS-2 and then some thin excuses. Why? Because if you didn't handle multiple code units in UTF-8 nothing worked, you couldn't even write some English words like café properly. For years pretending your UCS-2 code was UTF-16 would only be noticed by people using obscure writing systems or academics.

I am also reminded of approaches to i18n for software primarily developed and tested mainly by monolingual English speakers. Obviously these users won't know if a localised variant they're examining is correctly translated, but they can be given a fake "locale" in which translated text is visibly different in some consistent way, e.g. it has been "flipped" upside down by abusing symbols that look kind of like the Latin alphabet upside down, or Pig Latin is used "Openway Ocumentday". The idea here again is that problems are obvious rather than corner cases, if the translations are broken or missing it'll say "Open Document" in the test locale which is "wrong" and you don't need to wait for a specialist German-speaking tester to point that out.

link

Manishearth 2813 days ago

> There's a very sound argument to be made for the opposite conclusion, that if we care about a problem we should make it necessary to solve the problem correctly or else stuff very obviously breaks, not have broken systems seem like they kinda work until they're used in anger.

Oh, definitely :)

I'm rationalizing the focus on grapheme clusters, if I had my way "what is a string" would be a mandatory unit of programming language education and reasoning about this would be more strongly enforced by programming languages.

link