Hacker News new | ask | show | jobs
by paulddraper 2390 days ago
Also a law that the author is thinking of one and only one programming language.

> String length is typically determined by counting codepoints.

That depends entirely on what "strings" you are talking about.

In C/Go/Rust/Ruby, char*/string/std::string::String/String is bytes.

In Java/JavaScript, java.lang.String/String is UTF-16 code units.

In Python 3, str is code points.

In Swift, String is extended grapheme clusters.

In Haskell, there are various different "string" types in common use.

And in C++, std::basic_string is a generic container for whatever element type you want. (std::string specialization being for bytes.)

EDIT: Clarified that I don't disagree with parent comment; merely pointing out additional less-than-precise language.

3 comments

Sure, different languages have various, usually bad, definitions of length.

The point is that those two sentences themselves in the article are conflicting with each other, not that we're talking about any language in particular. (But certainly the article could go into a survey of common languages like you have.)

I think Rust strings are all Unicode native, though they can be transmuted to bytes.
It's a bit of both. Rust strings are (guaranteed valid) UTF-8 bytes.

str.len() returns the number of bytes; s.chars().count() returns the number of characters.

I'm in love with Swift's approach, where the default representation is a well defined thing that both users and developers think of as "characters", but all the other representations are trivially accessible.
I disagree. Grapheme clusters are locale-dependent, much like string collation is locale-dependent. What Unicode gives you by default, the (extended) grapheme cluster, is as useful as the DUCET (Default Unicode Collation Element Table); while you can live with them, you would be unsatisfied. In fact there are tons of Unicode bugs that can't be corrected due to the compatibility reason, and can only be fixed via tailored locale-dependent schemes.

I would like to avoid locales in the language core. It would be great to have locale stuffs in the standard library, but without locale information you can't treat strings as (human) texts.

Can you give examples of locale-dependent things, or issues with extended grapheme clusters?
Hangul normalization and collation is broken in Unicode, albeit for slightly different reasons. The Unicode Collation Algorithm explictly devotes two sections related to Hangul; the first section, for "trailing weights" [1], is recommended for the detailed explanation.

The Unicode Text Segmentation standard [2] explicitly mentions that Indic aksaras [3] require the tailoring to grapheme clusters. Depending on the view, you can also consider orthographic digraphs as examples (Dutch "ij" is sometimes considered a single character for example).

[1] https://www.unicode.org/reports/tr10/#Trailing_Weights

[2] https://unicode.org/reports/tr29/

[3] https://en.wikipedia.org/wiki/Aksara#Grammatical_tradition

For example, the text "ch" (U+0063 U+0068) is two grapheme clusters in English contexts, but one grapheme cluster in Czech contexts, collated between "h" and "i". [1]

According to Unicode, the text "Chemie" is written exactly the same whether it's the German or the Czech word. However, a German will say it has six letters and a Czech will say it has five.

Unicode provided a unified way to express international characters within the same text, but the context (i.e. locale) external to the text is still required to sensibly collate and manipulate it according to human sensibilities.

The default definition of grapheme clusters is simply a compromise for a global, locale-less understanding of collation/manipulation of Unicode characters.

> The Unicode definitions of grapheme clusters are defaults: not meant to exclude the use of more sophisticated definitions of tailored grapheme clusters where appropriate. Such definitions may more precisely match the user expectations within individual languages for given processes. For example, “ch” may be considered a grapheme cluster in Slovak, for processes such as collation. The default definitions are, however, designed to provide a much more accurate match to overall user expectations for what the user perceives of as characters than is provided by individual Unicode code points.

> Note: The default Unicode grapheme clusters were previously referred to as "locale-independent graphemes." The term cluster is used to emphasize that the term grapheme is used differently in linguistics. For simplicity and to align terminology with Unicode Technical Standard #10, “Unicode Collation Algorithm” [UTS10], the terms default and tailored are preferred over locale-independent and locale-dependent, respectively.

[1] https://en.wikipedia.org/wiki/Ch_(digraph)

[2] http://www.unicode.org/reports/tr29/

Agreed. Normal "text" operations tend to work quite well.

And the other forms are accessible, e.g. if you write a text-based parser (XML, JSON, etc.), you'll probably want String.unicodeScalars.