| HN Mirror

> I wouldn't call a language incorrect for working only with Unicode codepoints.

I didn't call Python incorrect. I'll footnote another attempt to communicate the point I was trying to make.^1

> for most programmers this entire class of problem is someone else's problem

Do you primarily mean most western devs when you say "most programmers" or are you including chinese, indian, etc.?

Do you primarily mean rendering issues when you say "this entire class of problems" or do you mean the full range of character handling issues that ordinary devs occasionally encounter such as comparing strings?

> and the people who do need to care have needed to learn most of e.g. http://unicode.org/reports/tr29/

To date, and for the near term future, sure.

But do you think it was the long term intent of Unicode that devs who merely want to grab the first three characters from a Unicode string have to first get up to speed on these incredibly complex portions of the Unicode standard if they wish to get it right?

TC made great points in his SO -- but so did the guy asking the question.

Now users have begun inserting colorized emojis and other such complexities in what might reasonably be considered contemporary run-of-the-mill text strings (eg tweets). I think this problem is going to accelerate.

Those who designed Elixir, Perl 6 and Swift have taken on Unicode, including TR29, as a core language level responsibility so that devs who merely use these programming languages don't immediately get overwhelmed when they just want to compare two strings.

> 'क्षि'.elems == 1

In Perl 6 `.elems` is always `1` for any single string of any length.

> 'क्षि'.chars == 2

By default, character boundaries are determined by the default EGC algorithm specified by Unicode. The default EGC algorithm gives the incorrect result for क्षि.

Getting the correct result (`== 1`) would require `use`ing a module that implements the appropriate tailored grapheme clustering.

> I was more interested in questions like measured width in a certain font or language-specific collation or word-breaking rules.

For now, the Perl 6 perspective on such matters is that devs should use the appropriate Perl 5 modules:

    use Some::Perl5::Module:from<Perl5>;

    ... Perl 6 code ...

Thanks for this exchange. I'm curious to see if you still feel I'm ranting. For now I'm off to an end of world party. Maybe we'll wake up to find President Evan McWho is in charge...

----

^1 A search for "grapheme" (the term used in the Unicode standard to denote what I mean by "atomic character" and what a user thinks of as a character) yields zero matches. Does Python doc use some other term to denote what a user thinks of as a character? Microsoft uses the term "text element". Swift and Perl 6 use the term "character". What term does Python use?

A search in the Python 3 docs for "character" yields several pages that total over 500 matches. I looked at a few. All corresponded to use of the word "character" to denote a codepoint (an accent, a colorizing instruction, a bidi directive, a base letter, etc.). None corresponded to what a user thinks of as a character. Do you think any uses of the term "character" will turn out to correspond to what a user thinks of as a character / text element / grapheme / whatever you and/or Python docs wish to call them?