| HN Mirror

Y	Hacker News new \| ask \| show \| jobs

by masklinn 3947 days ago

> Simple enough, in essence given first argument, print it up to length 12. As an added this also deals with unicode correctly

That's not true, Python 3 uses codepoint-based indexing but it will break if combining characters are involved. For instance:

    > python3 test.py देवनागरीदेवनागरी
    देवनागरीदेवन

because there is no precombined version of the multi-codepoint grapheme clusters so some of these 10 user-visible characters takes more than a single you end up with 8 user-visible characters rather than the expected 10.

edit: the original version used the input string "ǎěǐǒǔa̐e̐i̐o̐u̐ȃȇȋȏȗ" where clusters turn out to have precomposed versions after all. Replaced it by devanāgarī repeated once (in the devanāgarī script)

4 comments

Veedrac 3947 days ago

The easy Python way:

    import sys
    import regex
    print(regex.match("\X{,12}", sys.argv[1]).group())

with the regex[1] package that should be in the stdlib Any Day Now™.

[1]: https://pypi.python.org/pypi/regex

link

Spiritus 3947 days ago

Interesting, I had no idea the `re` module was getting revamped. Scheduled for 3.5 or later?

link

Veedrac 3947 days ago

Certainly not 3.5, although a few years ago I would have told you almost the exact opposite.

I wouldn't hold your breath. The issue tracker[1] suggests 3.7 or 3.8 as optimistic. Guido made some comment somewhere relatively recently, but I can't find where. It's entirely possible it will never actually happen; time doesn't seem to have made people more enthusiastic.

It's a shame, because the new module is awesome.

[1] http://bugs.python.org/msg230846

link

stevenbedrick 3947 days ago

Yup. A long time ago, while working on a project with some particularly gnarly Unicode issues, I got in the habit of thinking in terms of grapheme clusters instead of code points (or "characters", for whatever definition of "character" one wishes to use), and it has served me very well. Combining characters pop up in the most interesting places, often where and when you least expect them! ٩(•̃̾●̮̮̃̾•̃̾)۶

Ruby's unicode_utils gem has a nice implementation of the standard grapheme cluster segmentation algorithm, and Python's wrapper around ICU works quite well. Go's concept of runes is certainly an improvement, but it doesn't handle combining characters out of the box...

link

masklinn 3946 days ago

> Combining characters pop up in the most interesting places, often where and when you least expect them! ٩(•̃̾●̮̮̃̾•̃̾)۶

The good news is Unicode 8 will make them way more frequent! (alternate emoji skin colors are specified via combining characters) much as Unicode 6 made astral characters way more "in your face" (by standardising emoji in the SMP)

link

hahainternet 3947 days ago

That's a shame, it works as you'd expect in perl6:

  sub MAIN($s) { say $s.substr(0,12) }

  $ perl6 test.p6 ǎěǐǒǔa̐e̐i̐o̐u̐ȃȇȋȏȗ
  ǎěǐǒǔa̐e̐i̐o̐u̐ȃȇ

link

masklinn 3947 days ago

Turns out there are precomposed versions of these clusters, so your system might just be using these.

Could you retry with the input "देवनागरीदेवनागरी"?

link

hahainternet 3947 days ago

I'm not quite sure how to interpret the output as it doesn't render particularly kindly in my terminal:

  sub MAIN($s) {
  	say "{$s.chars}: $s";
  	my $b =  $s.substr(0,12);
  	say "{$b.chars}: $b";
  }

  $ perl6 hn-test2.p6 देवनागरीदेवनागरी
  16: देवनागरीदेवनागरी
  12: देवनागरीदेवन

link

masklinn 3947 days ago

So apparently perl6 is also "wrong" and operates on codepoints, your system composed my original string and each (base, diacritic) pair was pasted as a single precomposed character (I expect that if you try out the Python version on your system you'll also get the "right" answer).

The new string is composed of 10 user-visible characters (5 character repeated twice) but 16 codepoints (and this time I carefully checked that there was no precomposed version):

    DEVANAGARI LETTER DA
    DEVANAGARI VOWEL SIGN E
    DEVANAGARI LETTER VA
    DEVANAGARI LETTER NA
    DEVANAGARI VOWEL SIGN AA
    DEVANAGARI LETTER GA
    DEVANAGARI LETTER RA
    DEVANAGARI VOWEL SIGN II
    DEVANAGARI LETTER DA
    DEVANAGARI VOWEL SIGN E
    DEVANAGARI LETTER VA
    DEVANAGARI LETTER NA
    DEVANAGARI VOWEL SIGN AA
    DEVANAGARI LETTER GA
    DEVANAGARI LETTER RA
    DEVANAGARI VOWEL SIGN II

Operating on codepoints, both versions cut after the second DEVANAGARI LETTER NA (न) breaking that grapheme cluster (it should be ना) and not displaying the final two clusters ग and री.

link

raiph 3947 days ago

> So apparently perl6 is also "wrong" and operates on codepoints

Yes and no. Yes, because the in-development Rakudo compiler is clearly currently giving the wrong result, and no because it operates on grapheme clusters (but has bugs).

(You can work with codepoints if you really want to but the normal string/character functions that use the normal string type, Str, work -- or more accurately are supposed to work -- on the assumption that "character" == grapheme cluster; afaik it's supposed to match the Unicode default Extended Grapheme Cluster specification.)

Fwiw I've filed a bug: https://rt.perl.org/Ticket/Display.html?id=125927

link

hahainternet 3947 days ago

Yeah you're right, a caveat in the docs says that current implementations aren't finished with this. I was under the impression the NFG work was done but I'll catch up with people on irc.

link

raiph 3947 days ago

> I expect that if you try out the Python version on your system you'll also get the "right" answer.

I don't think so. In my tests standard python (2.7 and 3.5) ignores grapheme clusters.

link

masklinn 3946 days ago

Python ignores grapheme cluster, that point was about my original test case using grapheme clusters I later found out had precomposed equivalent, so a transfer chain performing NFC would leave the test case with no combining characters (or multi-codepoint grapheme clusters) left in it.

link

bmn_ 3947 days ago

Languages that cannot deal with graphemes are lame. I daresay this solution below should score 20 in OP's imaginary scale.

    $ perl -CADS -E'say $ARGV[0] =~ /(\X{5})/' देवनागरीदेवनागरी
    देवनागरी

Length of input string is: 10 graphemes, 16 codepoints, 48 octets (UTF-8).

Length of output string is: 5 graphemes, 8 codepoints, 24 octets (UTF-8).

link