Hacker News new | ask | show | jobs
by b2gills 3508 days ago
There definitely are good reasons for a length function to count both codepoints and graphemes. Which is why Perl 6 has a method for both. This is also why neither of them is called `length`. In fact if you ever attempt to call the `length` method on an object Rakudo will ask you "Did you mean 'elems', 'chars' or 'codes'"
1 comments

Yes - it's a good distinction to make and when you need it, it's extremely useful. As I said, Perl deserves respect for Unicode in general and raising awareness of this issue is a key reason for that.

However, keep in mind that I was responding to a comment which simply referred to “Unicode issues or needing to use silly "u" prefixes”. This whole tangent started with a guess about what the author might have had in mind.

On this specific point, note that I wasn't saying that it wasn't good to have both but that I wouldn't call a language incorrect for working only with Unicode codepoints. That's because for most programmers this entire class of problem is someone else's problem – usually whoever wrote the text rendering engine in your browser or OS – and the people who do need to care have needed to learn most of e.g. http://unicode.org/reports/tr29/ anyway and understand which portions are relevant to whatever task and data they're working with. It's kind of cool that e.g. 'क्षि'.elems == 1, 'क्षि'.chars == 2, etc. but on the rare occasions where that would be more than trivia, I was more interested in questions like measured width in a certain font or language-specific collation or word-breaking rules.

This is all coming back to why I don't think attacking other languages is effective advocacy unless you're very knowledgeable on the details and impact for working programmers. Telling someone that a commonly used tool which works well for millions of users is incorrect is unlikely to produce the desired outcome. Showing them a cool thing which your favorite tool does better is usually going to be more effective because it gives you something concrete to talk about and it's not confrontational. Programming languages are a major commitment and very few people are going to switch because of one bullet point – that either takes market requirements (e.g. Objective C/Swift, JavaScript) or gradually building up a good reputation over time.

> I wouldn't call a language incorrect for working only with Unicode codepoints.

I didn't call Python incorrect. I'll footnote another attempt to communicate the point I was trying to make.^1

> for most programmers this entire class of problem is someone else's problem

Do you primarily mean most western devs when you say "most programmers" or are you including chinese, indian, etc.?

Do you primarily mean rendering issues when you say "this entire class of problems" or do you mean the full range of character handling issues that ordinary devs occasionally encounter such as comparing strings?

> and the people who do need to care have needed to learn most of e.g. http://unicode.org/reports/tr29/

To date, and for the near term future, sure.

But do you think it was the long term intent of Unicode that devs who merely want to grab the first three characters from a Unicode string have to first get up to speed on these incredibly complex portions of the Unicode standard if they wish to get it right?

TC made great points in his SO -- but so did the guy asking the question.

Now users have begun inserting colorized emojis and other such complexities in what might reasonably be considered contemporary run-of-the-mill text strings (eg tweets). I think this problem is going to accelerate.

Those who designed Elixir, Perl 6 and Swift have taken on Unicode, including TR29, as a core language level responsibility so that devs who merely use these programming languages don't immediately get overwhelmed when they just want to compare two strings.

> 'क्षि'.elems == 1

In Perl 6 `.elems` is always `1` for any single string of any length.

> 'क्षि'.chars == 2

By default, character boundaries are determined by the default EGC algorithm specified by Unicode. The default EGC algorithm gives the incorrect result for क्षि.

Getting the correct result (`== 1`) would require `use`ing a module that implements the appropriate tailored grapheme clustering.

> I was more interested in questions like measured width in a certain font or language-specific collation or word-breaking rules.

For now, the Perl 6 perspective on such matters is that devs should use the appropriate Perl 5 modules:

    use Some::Perl5::Module:from<Perl5>;

    ... Perl 6 code ...
Thanks for this exchange. I'm curious to see if you still feel I'm ranting. For now I'm off to an end of world party. Maybe we'll wake up to find President Evan McWho is in charge...

----

^1 A search for "grapheme" (the term used in the Unicode standard to denote what I mean by "atomic character" and what a user thinks of as a character) yields zero matches. Does Python doc use some other term to denote what a user thinks of as a character? Microsoft uses the term "text element". Swift and Perl 6 use the term "character". What term does Python use?

A search in the Python 3 docs for "character" yields several pages that total over 500 matches. I looked at a few. All corresponded to use of the word "character" to denote a codepoint (an accent, a colorizing instruction, a bidi directive, a base letter, etc.). None corresponded to what a user thinks of as a character. Do you think any uses of the term "character" will turn out to correspond to what a user thinks of as a character / text element / grapheme / whatever you and/or Python docs wish to call them?