Hacker News new | ask | show | jobs
by masklinn 3947 days ago
> Simple enough, in essence given first argument, print it up to length 12. As an added this also deals with unicode correctly

That's not true, Python 3 uses codepoint-based indexing but it will break if combining characters are involved. For instance:

    > python3 test.py देवनागरीदेवनागरी
    देवनागरीदेवन
because there is no precombined version of the multi-codepoint grapheme clusters so some of these 10 user-visible characters takes more than a single you end up with 8 user-visible characters rather than the expected 10.

edit: the original version used the input string "ǎěǐǒǔa̐e̐i̐o̐u̐ȃȇȋȏȗ" where clusters turn out to have precomposed versions after all. Replaced it by devanāgarī repeated once (in the devanāgarī script)

4 comments

The easy Python way:

    import sys
    import regex
    print(regex.match("\X{,12}", sys.argv[1]).group())
with the regex[1] package that should be in the stdlib Any Day Now™.

[1]: https://pypi.python.org/pypi/regex

Interesting, I had no idea the `re` module was getting revamped. Scheduled for 3.5 or later?
Certainly not 3.5, although a few years ago I would have told you almost the exact opposite.

I wouldn't hold your breath. The issue tracker[1] suggests 3.7 or 3.8 as optimistic. Guido made some comment somewhere relatively recently, but I can't find where. It's entirely possible it will never actually happen; time doesn't seem to have made people more enthusiastic.

It's a shame, because the new module is awesome.

[1] http://bugs.python.org/msg230846

Yup. A long time ago, while working on a project with some particularly gnarly Unicode issues, I got in the habit of thinking in terms of grapheme clusters instead of code points (or "characters", for whatever definition of "character" one wishes to use), and it has served me very well. Combining characters pop up in the most interesting places, often where and when you least expect them! ٩(•̃̾●̮̮̃̾•̃̾)۶

Ruby's unicode_utils gem has a nice implementation of the standard grapheme cluster segmentation algorithm, and Python's wrapper around ICU works quite well. Go's concept of runes is certainly an improvement, but it doesn't handle combining characters out of the box...

> Combining characters pop up in the most interesting places, often where and when you least expect them! ٩(•̃̾●̮̮̃̾•̃̾)۶

The good news is Unicode 8 will make them way more frequent! (alternate emoji skin colors are specified via combining characters) much as Unicode 6 made astral characters way more "in your face" (by standardising emoji in the SMP)

That's a shame, it works as you'd expect in perl6:

  sub MAIN($s) { say $s.substr(0,12) }

  $ perl6 test.p6 ǎěǐǒǔa̐e̐i̐o̐u̐ȃȇȋȏȗ
  ǎěǐǒǔa̐e̐i̐o̐u̐ȃȇ
Turns out there are precomposed versions of these clusters, so your system might just be using these.

Could you retry with the input "देवनागरीदेवनागरी"?

I'm not quite sure how to interpret the output as it doesn't render particularly kindly in my terminal:

  sub MAIN($s) {
  	say "{$s.chars}: $s";
  	my $b =  $s.substr(0,12);
  	say "{$b.chars}: $b";
  }

  $ perl6 hn-test2.p6 देवनागरीदेवनागरी
  16: देवनागरीदेवनागरी
  12: देवनागरीदेवन
So apparently perl6 is also "wrong" and operates on codepoints, your system composed my original string and each (base, diacritic) pair was pasted as a single precomposed character (I expect that if you try out the Python version on your system you'll also get the "right" answer).

The new string is composed of 10 user-visible characters (5 character repeated twice) but 16 codepoints (and this time I carefully checked that there was no precomposed version):

    DEVANAGARI LETTER DA
    DEVANAGARI VOWEL SIGN E
    DEVANAGARI LETTER VA
    DEVANAGARI LETTER NA
    DEVANAGARI VOWEL SIGN AA
    DEVANAGARI LETTER GA
    DEVANAGARI LETTER RA
    DEVANAGARI VOWEL SIGN II
    DEVANAGARI LETTER DA
    DEVANAGARI VOWEL SIGN E
    DEVANAGARI LETTER VA
    DEVANAGARI LETTER NA
    DEVANAGARI VOWEL SIGN AA
    DEVANAGARI LETTER GA
    DEVANAGARI LETTER RA
    DEVANAGARI VOWEL SIGN II
Operating on codepoints, both versions cut after the second DEVANAGARI LETTER NA (न) breaking that grapheme cluster (it should be ना) and not displaying the final two clusters ग and री.
> So apparently perl6 is also "wrong" and operates on codepoints

Yes and no. Yes, because the in-development Rakudo compiler is clearly currently giving the wrong result, and no because it operates on grapheme clusters (but has bugs).

(You can work with codepoints if you really want to but the normal string/character functions that use the normal string type, Str, work -- or more accurately are supposed to work -- on the assumption that "character" == grapheme cluster; afaik it's supposed to match the Unicode default Extended Grapheme Cluster specification.)

Fwiw I've filed a bug: https://rt.perl.org/Ticket/Display.html?id=125927

Yeah you're right, a caveat in the docs says that current implementations aren't finished with this. I was under the impression the NFG work was done but I'll catch up with people on irc.
> I expect that if you try out the Python version on your system you'll also get the "right" answer.

I don't think so. In my tests standard python (2.7 and 3.5) ignores grapheme clusters.

Python ignores grapheme cluster, that point was about my original test case using grapheme clusters I later found out had precomposed equivalent, so a transfer chain performing NFC would leave the test case with no combining characters (or multi-codepoint grapheme clusters) left in it.
Languages that cannot deal with graphemes are lame. I daresay this solution below should score 20 in OP's imaginary scale.

    $ perl -CADS -E'say $ARGV[0] =~ /(\X{5})/' देवनागरीदेवनागरी
    देवनागरी
Length of input string is: 10 graphemes, 16 codepoints, 48 octets (UTF-8).

Length of output string is: 5 graphemes, 8 codepoints, 24 octets (UTF-8).