| HN Mirror

Y	Hacker News new \| ask \| show \| jobs

by acdha 3507 days ago

> That there's no such thing as Unicode issues or needing to use silly "u" prefixes. That 0.1 + 0.2 == 0.3 reliably that you can write accounting software worry-free and not wonder about why it fails in so many old langs.

It's interesting that you lead with a bunch of examples which are very old or simply wrong.

For example, in any language where you don't think about Unicode you are guaranteed to have encoding issues as soon as you have sufficiently diverse (i.e. real-world) data. If Python 2's u"" prefix offends you so – I must say, hearing a Perl programmer complain about punctuation is a somewhat novel experience – note that Python 3 was released in 2008 which changed to use Unicode by default. In every language, however, you will need to think about file encodings everywhere you read or write data until we finally hit that halcyon decade (century?) where you can assume UTF-8 with a very high level of confidence.

Similarly, Python has had decimal-accurate math since the early 2000s so the developer is free to pick whether they value absolute precision over standard IEEE floating point semantics. Perl 6 dynamically switches numeric classes so the simple syntax you show will lose precision at some point depending on the data and order of operations – that's why the documentation specifies Rat as “limited precision” and it means that anyone writing accounting (or, in many cases, scientific) software would explicitly use an arbitrary-precision data type to avoid the classic floating-point math problems:

    $ perl6 --version
    This is Rakudo version 2016.10 built on MoarVM version 2016.10
    $ perl6
    > 123456789 - 1e-5
    123456788.99999
    > 123456789 - 1e-6
    123456788.999999
    > 123456789 - 1e-7
    123456789
    > (123456789 - 1e-1) - 123456789
    -0.0999999940395355

Please note that I'm not saying Perl 6 is terrible. This is a well known trade-off which everyone has to learn about if they work in fields where this matters. Perl 6 has arbitrary-precision types built-in so anyone dealing with financial data is simply going to learn to specify that precision is not an acceptable trade-off for performance:

    > my $a = Num(12356789)
    12356789
    > $a + FatRat(0.1) - $a
    0.099999999627471
    > my $a = FatRat(12356789)
    12356789
    > $a + FatRat(0.1) - $a
    0.1

I would suggest focusing on the parts of Perl 6 which you like rather than hurling bricks at other languages. Things like start / hyper sound kind of interesting and it'd be far more interesting to hear about how those work in practice and how you manage issues like communications overhead or shared data than your personal dislikes about some other language.

2 comments

raiph 3507 days ago

> Python 3 was released in 2008 which changed to use Unicode by default.

Sure, but Python 3 only took on correct handling at the whole string level. It ignored correct handling at the character and sub-string level. Something similar applies for many programming languages. I think this was Zoffix's point.[1]

> explicitly use an arbitrary-precision data type to avoid the classic floating-point math problems:

    > 123456789 - 1e-5
    123456788.99999

Literals of the form `1e-5` are not arbitrary precision in Perl 6. They are floating point (called Num in Perl 6).Hence your above result. Similarly:

    > my $a = Num(12356789)
    12356789
    > $a + FatRat(0.1) - $a
    0.099999999627471

would work if instead you wrote:

    > my $a = 12356789
    12356789
    > $a + 0.1 - $a
    0.1
    > $a + 0.00001 - $a
    0.00001

If all inputs are 100% accurate, arbitrary precision Ints and/or FatRats then all results will be too. But the same applies even if some Rats are also involved provided the final result requires a denominator less than 18,446,744,073,709,551,615.

[1] Of the 150+ languages with Rosettacode solutions for returning the length of a string (at http://rosettacode.org/wiki/String_length) just 3 (Elixir, Perl 6, Swift) have a built in way to get the right result for what Unicode defines as "what a user thinks of as a character".

link

acdha 3507 days ago

> Sure, but Python 3 only took on correct handling at the whole string level. It ignored correct handling at the character and sub-string level. Something similar applies for many programming languages. I think this was Zoffix's point

It very well could have been what Zoffix had in mind but do note that this goes back to the problem with the rant style of post.

As for the specific question of character handling which you raised, given the wide number of people who have made other choices on the question of whether a length function should count codepoints or graphemes that I'm reluctant to say using one versus the other is a question of “correct” or “incorrect” versus simply “different”. I suspect most programmers are rarely going to care and the ones who do are going to need to learn enough more about Unicode and i18n that this difference will not be a deciding factor for anyone.

(This is not to detract from the great history the Perl community has with taking seriously the benefits of having a rich Unicode API – I've routinely used this as an example to follow – but simply that Unicode is a deceptively simple-looking topic)

> Literals of the form `1e-5` are not arbitrary precision in Perl 6. They are floating point (called Num in Perl 6).

Yes. Again, my point wasn't that Perl 6 is bad but rather that it's lazy and ineffective advocacy to say something like “you can write accounting software worry-free” when you know full well that anyone doing that for real will still need to understand the differences and that most users will never care because they aren't writing financial software.

link

b2gills 3507 days ago

There definitely are good reasons for a length function to count both codepoints and graphemes. Which is why Perl 6 has a method for both. This is also why neither of them is called `length`. In fact if you ever attempt to call the `length` method on an object Rakudo will ask you "Did you mean 'elems', 'chars' or 'codes'"

link

acdha 3506 days ago

Yes - it's a good distinction to make and when you need it, it's extremely useful. As I said, Perl deserves respect for Unicode in general and raising awareness of this issue is a key reason for that.

However, keep in mind that I was responding to a comment which simply referred to “Unicode issues or needing to use silly "u" prefixes”. This whole tangent started with a guess about what the author might have had in mind.

On this specific point, note that I wasn't saying that it wasn't good to have both but that I wouldn't call a language incorrect for working only with Unicode codepoints. That's because for most programmers this entire class of problem is someone else's problem – usually whoever wrote the text rendering engine in your browser or OS – and the people who do need to care have needed to learn most of e.g. http://unicode.org/reports/tr29/ anyway and understand which portions are relevant to whatever task and data they're working with. It's kind of cool that e.g. 'क्षि'.elems == 1, 'क्षि'.chars == 2, etc. but on the rare occasions where that would be more than trivia, I was more interested in questions like measured width in a certain font or language-specific collation or word-breaking rules.

This is all coming back to why I don't think attacking other languages is effective advocacy unless you're very knowledgeable on the details and impact for working programmers. Telling someone that a commonly used tool which works well for millions of users is incorrect is unlikely to produce the desired outcome. Showing them a cool thing which your favorite tool does better is usually going to be more effective because it gives you something concrete to talk about and it's not confrontational. Programming languages are a major commitment and very few people are going to switch because of one bullet point – that either takes market requirements (e.g. Objective C/Swift, JavaScript) or gradually building up a good reputation over time.

link

raiph 3506 days ago

> I wouldn't call a language incorrect for working only with Unicode codepoints.

I didn't call Python incorrect. I'll footnote another attempt to communicate the point I was trying to make.^1

> for most programmers this entire class of problem is someone else's problem

Do you primarily mean most western devs when you say "most programmers" or are you including chinese, indian, etc.?

Do you primarily mean rendering issues when you say "this entire class of problems" or do you mean the full range of character handling issues that ordinary devs occasionally encounter such as comparing strings?

> and the people who do need to care have needed to learn most of e.g. http://unicode.org/reports/tr29/

To date, and for the near term future, sure.

But do you think it was the long term intent of Unicode that devs who merely want to grab the first three characters from a Unicode string have to first get up to speed on these incredibly complex portions of the Unicode standard if they wish to get it right?

TC made great points in his SO -- but so did the guy asking the question.

Now users have begun inserting colorized emojis and other such complexities in what might reasonably be considered contemporary run-of-the-mill text strings (eg tweets). I think this problem is going to accelerate.

Those who designed Elixir, Perl 6 and Swift have taken on Unicode, including TR29, as a core language level responsibility so that devs who merely use these programming languages don't immediately get overwhelmed when they just want to compare two strings.

> 'क्षि'.elems == 1

In Perl 6 `.elems` is always `1` for any single string of any length.

> 'क्षि'.chars == 2

By default, character boundaries are determined by the default EGC algorithm specified by Unicode. The default EGC algorithm gives the incorrect result for क्षि.

Getting the correct result (`== 1`) would require `use`ing a module that implements the appropriate tailored grapheme clustering.

> I was more interested in questions like measured width in a certain font or language-specific collation or word-breaking rules.

For now, the Perl 6 perspective on such matters is that devs should use the appropriate Perl 5 modules:

    use Some::Perl5::Module:from<Perl5>;

    ... Perl 6 code ...

Thanks for this exchange. I'm curious to see if you still feel I'm ranting. For now I'm off to an end of world party. Maybe we'll wake up to find President Evan McWho is in charge...

----

^1 A search for "grapheme" (the term used in the Unicode standard to denote what I mean by "atomic character" and what a user thinks of as a character) yields zero matches. Does Python doc use some other term to denote what a user thinks of as a character? Microsoft uses the term "text element". Swift and Perl 6 use the term "character". What term does Python use?

A search in the Python 3 docs for "character" yields several pages that total over 500 matches. I looked at a few. All corresponded to use of the word "character" to denote a codepoint (an accent, a colorizing instruction, a bidi directive, a base letter, etc.). None corresponded to what a user thinks of as a character. Do you think any uses of the term "character" will turn out to correspond to what a user thinks of as a character / text element / grapheme / whatever you and/or Python docs wish to call them?

link

espadrine 3507 days ago

For the sake of exhaustivity, let's point out that numeric literals in Perl6 separate rationals (Rat) and floating-point numbers (Num), which means the first Perl6 example would work as intended provided we stick with Rat:

  > 123456789 - 0.0000001
  123456788.9999999
  > (123456789 - 0.1) - 123456789
  -0.1

Rationals represent numbers of the form a÷b, with a a bigint and b a 64-bit integer. When b gets too big for that fast representation, it gets converted to a Num (ie, IEEE 754).

  > (1 / (10 ** 100)).WHAT
  (Num)

That tradeoff is reasonable, as numbers that cannot be represented this way are very likely to be non-rational math (eg, sqrt, exp, sin, pi, that kind of thing). For money, this is safe. For real numbers, well:

  > sin(0).WHAT
  (Num)

FatRat uses a bigint for b. Obviously, it still cannot accurately represent non-rational numbers such as pi.

Reference: https://docs.perl6.org/language/syntax#Number_literals https://docs.perl6.org/type/Rat

link

acdha 3507 days ago

Thanks! I definitely hope my point didn't come across as “this is terrible” rather than “people for whom this is critical still need to read the docs / write tests”.

link