Hacker News new | ask | show | jobs
by erik_seaberg 811 days ago
It's weird that any parser that loses digits is tolerated. A parser that forces strings into uppercase US-ASCII never would be.
3 comments

It's tolerated because the JSON spec explicitly allows it:

   This specification allows implementations to set limits on the range
   and precision of numbers accepted.  Since software that implements
   IEEE 754 binary64 (double precision) numbers [IEEE754] is generally
   available and widely used, good interoperability can be achieved by
   implementations that expect no more precision or range than these
   provide, in the sense that implementations will approximate JSON
   numbers within the expected precision.  A JSON number such as 1E400
   or 3.141592653589793238462643383279 may indicate potential
   interoperability problems, since it suggests that the software that
   created it expects receiving software to have greater capabilities
   for numeric magnitude and precision than is widely available.

   Note that when such software is used, numbers that are integers and
   are in the range [-(2**53)+1, (2**53)-1] are interoperable in the
   sense that implementations will agree exactly on their numeric
   values.
And yes, this is completely insane for a format that supposed to be specifically for serialization and interop. Needless to say, the industry has enthusiastically adopted it to the point where it became the standard.

I miss XML these days. Sure, it was verbose and had a bunch of different and probably excessive numeric types defined for XML Schema... but at least they were well-defined (https://www.w3.org/TR/xmlschema-2/#built-in-datatypes). And, on the other hand, without a schema, all you had were strings. Either way, no mismatched expectations.

That's true for every floating point number in every programming language you have ever used, though.

    $ python3
    Python 3.10.13 (main, Aug 24 2023, 12:59:26) [GCC 12.2.0] on linux
    Type "help", "copyright", "credits" or "license" for more information.
    >>> 100000.000000000017
    100000.00000000001
This is why Decimal exists:

  Python 3.8.10 (default, Nov 22 2023, 10:22:35) 
  [GCC 9.4.0] on linux
  Type "help", "copyright", "credits" or "license" for more information.
  >>> from decimal import Decimal
  >>> Decimal('100000.000000000017')
  Decimal('100000.000000000017')
For example:

  >>> import json
  >>> json.loads('{"a": 100000.000000000017}')
  {'a': 100000.00000000001}
  >>> json.loads('{"a": 100000.000000000017}', parse_float=Decimal)
  {'a': Decimal('100000.000000000017')}
And not every programming language offers a Decimal type and on most of those, there’s usually a performance penalty associated with it not to mention issues of interoperability and developer knowledge of its existence. For financial calculations, usually using integers with an implicit decimal offset (e.g., US currency amounts being expressed in cents rather than dollars), while other contexts will often determine that the inherent inaccuracy of IEEE floating types is a non-issue. The biggest potential problem lies in treating values that act kind of like numbers and look like numbers as numbers, e.g., Dewey Decimal classification numbers or the topic in a Library of Congress classification.¹

1. This is a bit on my mind lately as I discovered that LibraryThing’s sort by LoC classification seems to be broken so I exported my library (discovering that they export as ISO8859-1 with no option for UTF-8) and wrote a custom sorter for LOC classification codes for use in finally arranging the books on my shelves after my move last year.

Decimal is not arbitrary precision, though. It has many of the same issues, you'll just see them in different places.

  >>> Decimal('100000.00000000000000000000017') + Decimal('1')
  Decimal('100001.0000000000000000000002')
but serializing/deserializing decimal using the json module is futile
Why is it futile? It can be serialized/deserialized perfectly through its string representation.
> That's true for every floating point number in every programming language you have ever used, though.

Alright, if "you" have only ever used python. In C, for example, we have hexadecimal floating point literals that represent all floats and doubles exactly (including infinities and nans that make the json parser fail miserably).

If you use the same syntax as OP, C’s parser will also round that literal. The existence of a hex literal for floats is something orthogonal
> we have hexadecimal floating point literals that represent all floats and doubles exactly

How do you do that?

A couple of resources I found but which I’m not sure if are about exactly what you speak of

https://stackoverflow.com/questions/65480947/is-ieee-754-rep...

https://gcc.gnu.org/onlinedocs/gcc/Hex-Floats.html

Furthermore, what exactly do you mean by “all floats and doubles exactly”?

Yes, I was talking about what is described in your resources. You can do this:

    // define a floating-point literal in hex and print it in decimal
    float x = 0x1p-8;          // x = 1.0/256
    printf("x = %g\n", x);     // prints 0.00390625
    
    // define a floating point literal in decimal and print it in various ways
    float y = 0.3;             // non-representable, rounded to closest float
    printf("y = %g\n", y);     // 0.3 (the %g format does some heuristics)
    printf("y = %.10f\n", y);  // 0.3000000119
    printf("y = %.20f\n", y);  // 0.30000001192092895508
    printf("y = %a\n", f);     // 0x1.333334p-2
So for example if you make a variable that has the value parent commenter used

100000.000000000017

And then you print it.

Does it preserve the exact value?

Your question is ambiguous for two different reasons. First, this value is not representable as a floating-point number, so there's no way that you can even store it in a float. Second, once you have a float variable, you can print it in many different ways. So, the answer to your question is, irremediably, "it depends what you mean by exact value".

If you print your variable with the %a format, then YES, the exact value is preserved and there is no loss of information. The problem is that the literal that you wrote cannot be represented exactly. But this is hardly a fault of the floats. Ints have exactly the same problem:

    int x = 2.5;   // x gets the value 2
    int y = 7/3;   // same thing
https://0.30000000000000004.com/

Although it would be good to move in the direction of using a BigDecimal equivalent by default when ingesting unknown data.