Hacker News new | ask | show | jobs
by deathanatos 2395 days ago
I swear there should be some rule or law about how Unicode articles will inevitably muddle code units/points / grapheme clusters / bytes together.

> String length is typically determined by counting codepoints.

> This means that surrogate pairs would count as two characters.

If you were counting code points, a surrogate pair would be 1. If it's two, you're counting code units.

> Combining multiple diacritics may be stacked over the same character. a + ̈ == ̈a, increasing length, while only producing a single character.

Not if you're counting code points or code units, which would both produce an answer of "2", and that's a great example of why you shouldn't count with either.

The dark blue on black in tables is next to invisible. And then to put that on white on the alternate rows is just eyeball murder.

> Since there are over 1.1 million UTF-8 glphys (sic)

UTF-8 glyphs twitch; aside from that, I'm really curious how they got that number. In some ways, a font has it easy; my understanding is that modern font formats can do one glyph for acute accent, one glyph for all the vowels/letters, and then compose the glyphs into arrangements for having them combined. (IDK if those are also "glyphs" to the font or not.) But it's less drawing, at least. OTOH, some characters have >1 appearance/"image", AIUI.

4 comments

Also a law that the author is thinking of one and only one programming language.

> String length is typically determined by counting codepoints.

That depends entirely on what "strings" you are talking about.

In C/Go/Rust/Ruby, char*/string/std::string::String/String is bytes.

In Java/JavaScript, java.lang.String/String is UTF-16 code units.

In Python 3, str is code points.

In Swift, String is extended grapheme clusters.

In Haskell, there are various different "string" types in common use.

And in C++, std::basic_string is a generic container for whatever element type you want. (std::string specialization being for bytes.)

EDIT: Clarified that I don't disagree with parent comment; merely pointing out additional less-than-precise language.

Sure, different languages have various, usually bad, definitions of length.

The point is that those two sentences themselves in the article are conflicting with each other, not that we're talking about any language in particular. (But certainly the article could go into a survey of common languages like you have.)

I think Rust strings are all Unicode native, though they can be transmuted to bytes.
It's a bit of both. Rust strings are (guaranteed valid) UTF-8 bytes.

str.len() returns the number of bytes; s.chars().count() returns the number of characters.

I'm in love with Swift's approach, where the default representation is a well defined thing that both users and developers think of as "characters", but all the other representations are trivially accessible.
I disagree. Grapheme clusters are locale-dependent, much like string collation is locale-dependent. What Unicode gives you by default, the (extended) grapheme cluster, is as useful as the DUCET (Default Unicode Collation Element Table); while you can live with them, you would be unsatisfied. In fact there are tons of Unicode bugs that can't be corrected due to the compatibility reason, and can only be fixed via tailored locale-dependent schemes.

I would like to avoid locales in the language core. It would be great to have locale stuffs in the standard library, but without locale information you can't treat strings as (human) texts.

Can you give examples of locale-dependent things, or issues with extended grapheme clusters?
Hangul normalization and collation is broken in Unicode, albeit for slightly different reasons. The Unicode Collation Algorithm explictly devotes two sections related to Hangul; the first section, for "trailing weights" [1], is recommended for the detailed explanation.

The Unicode Text Segmentation standard [2] explicitly mentions that Indic aksaras [3] require the tailoring to grapheme clusters. Depending on the view, you can also consider orthographic digraphs as examples (Dutch "ij" is sometimes considered a single character for example).

[1] https://www.unicode.org/reports/tr10/#Trailing_Weights

[2] https://unicode.org/reports/tr29/

[3] https://en.wikipedia.org/wiki/Aksara#Grammatical_tradition

For example, the text "ch" (U+0063 U+0068) is two grapheme clusters in English contexts, but one grapheme cluster in Czech contexts, collated between "h" and "i". [1]

According to Unicode, the text "Chemie" is written exactly the same whether it's the German or the Czech word. However, a German will say it has six letters and a Czech will say it has five.

Unicode provided a unified way to express international characters within the same text, but the context (i.e. locale) external to the text is still required to sensibly collate and manipulate it according to human sensibilities.

The default definition of grapheme clusters is simply a compromise for a global, locale-less understanding of collation/manipulation of Unicode characters.

> The Unicode definitions of grapheme clusters are defaults: not meant to exclude the use of more sophisticated definitions of tailored grapheme clusters where appropriate. Such definitions may more precisely match the user expectations within individual languages for given processes. For example, “ch” may be considered a grapheme cluster in Slovak, for processes such as collation. The default definitions are, however, designed to provide a much more accurate match to overall user expectations for what the user perceives of as characters than is provided by individual Unicode code points.

> Note: The default Unicode grapheme clusters were previously referred to as "locale-independent graphemes." The term cluster is used to emphasize that the term grapheme is used differently in linguistics. For simplicity and to align terminology with Unicode Technical Standard #10, “Unicode Collation Algorithm” [UTS10], the terms default and tailored are preferred over locale-independent and locale-dependent, respectively.

[1] https://en.wikipedia.org/wiki/Ch_(digraph)

[2] http://www.unicode.org/reports/tr29/

Agreed. Normal "text" operations tend to work quite well.

And the other forms are accessible, e.g. if you write a text-based parser (XML, JSON, etc.), you'll probably want String.unicodeScalars.

> If you were counting code points, a surrogate pair would be 1. If it's two, you're counting code units.

And to be explicit as to why that is: surrogate pairs are a feature of the UTF-16 encoding, where two 16-bit code units ("code units" being the lexemes of the decoder) decode to a single Unicode codepoint.

I feel like everything to do with Unicode is clearer if you never bring up how it's encoded; or, alternately, if you pretend for the sake of your tutorial that everybody uses UTF-32, so you can just talk about flinging single-code-unit codepoints around as machine-words, the same way ASCII flings single-code-unit codepoints around as bytes. This being basically what Unicode text-handling libraries are doing underneath anyway.

After all, from the perspective of the Unicode standard itself, all the stuff below the abstraction of "a codepoint" is implementation detail.

The standard has to let the abstraction leak in a few places, like surrogate pairs or BOMs, but these leaks aren't what the Unicode standard is supposed to be "about", and should really be thought of as features of the encodings that have found their way up a layer, rather than features of Unicode per se. Heck, even the categorization of codepoint-ranges into "planes" is just a pragma of UTF-16. Putting these pragma-features front-and-center in a discussion of "what Unicode is", is IMHO entirely backwards.

>I swear there should be some rule or law...

Now is your chance! Distill this comment down into something pithy and deathanatos' law could be a thing.

Naming things is the hardest problem ;)
Thanks for catching. It's a fairly complex subject matter- and particularly hard get extra eye balls willing to check for typos.

- String length is typically measured in code units. - Funny enough, with Unicode normalization, multiple diacritics can be reduced into a single code point.