Hacker News new | ask | show | jobs
by camgunz 480 days ago
Sure, I think CBOR's "suggested" tags (or whatever they are) are probably useful to most people. The tradeoff is that they create pressure for implementations to support them, and that's not free. For example, bignum libraries are pretty heavyweight; they're not really the kind of thing you'd want to include in a C implementation as a dependency, especially when very few of your users will use them. Well OK, now you have a choice between:

- include it anyway, bloat your library for almost everyone, maybe consider supporting different underlying implementations, manage all these dependencies forever, also those libraries have different ways of setting precision, allocating statically or dynamically, etc, so expose that somehow

- don't include it, you're probably now incompatible with all dynamic language implementations that get bignums for free and you should note that up front

This is just one example, but it's pretty representative of Bormann's "have your cake and eat it too" design instincts where he tosses on features and doesn't consider the tradeoffs.

> One example is the tag 24 "Encoded CBOR data item" (Section 3.4.5.1), which indicates that the following byte string is encoded as CBOR. Since this string has the size in bytes, every array or map can be embedded in such tags to ensure the easy skippability.

This only works for types that aren't nested unless you significantly complicate bookkeeping during serialization (store the byte size of every compound object up front), which has the potential to seriously slow down serializing. My approach to that would be to let individual apps do that if they want (encode the size manually), because I don't think it's a common usage.

2 comments

> Well OK, now you have a choice between: - include it anyway, [...] - don't include it, [...]

So guess that's why MP doesn't have a bignum. But MP's inability to store anything more than (u)int64 and float64 does make its data model technically different from JSON because JSON didn't properly specify that its number format should be round-trippable in those native types. Even worse, if you could assume that everything is at most float64 then you still have to write a considerable amount of subtle code to do the correct round-trip! [1] At this point your code would already contain some bignum stuffs anyway. So why not support bignums then?

[1] Correct floating point formatting and parsing is very difficult and needs a non-trivial amount of precomputed tables and sometimes bignum routines (depends on the exact algorithm)---for the record I'm the main author of Rust's floating point formatting routine. Also for this reason, most language-standard libraries already have a hidden support for size-limited bignums!

> My approach to that would be to let individual apps do that if they want (encode the size manually), because I don't think it's a common usage.

I mean, the supposed processability is already a poorly defined metric as I wrote earlier. I too suppose that it would be entirely up to the application's (or possibly library's educated) request

> But MP's inability to store anything more than (u)int64 and float64 does make its data model technically different from JSON....

Yeah I don't love the MP/JSON comparison the site pushes. I don't really think they solve the same problems, but the reasons are kind of obscure so shrug. MP is quite different from JSON and yeah, numbers is one of those ways.

> [1] Correct floating point formatting and parsing is very difficult and needs a non-trivial amount of precomputed tables and sometimes bignum routines (depends on the exact algorithm)---for the record I'm the main author of Rust's floating point formatting routine. Also for this reason, most language-standard libraries already have a hidden support for size-limited bignums!

Oh man yeah tell me about it; I attempted this way back when and gave up lol. I was doing a bunch of research into arbitrary precision libraries and the benchmarks all contain "rendering a big 'ol floating point number" and that's why. Wild.

> I mean, the supposed processability is already a poorly defined metric as I wrote earlier. I too suppose that it would be entirely up to the application's (or possibly library's educated) request

I think in practice implementations are either heavily spec'd (FIDO) on top of a restricted subset of CBOR, or they control both sender and receiver. This is why I think much of the additional protocol discussion in CBOR is pretty moot; if you're taking the CBOR spec's advice on protocols you're not building a good protocol.

> Oh man yeah tell me about it; I attempted this way back when and gave up lol. I was doing a bunch of research into arbitrary precision libraries and the benchmarks all contain "rendering a big 'ol floating point number" and that's why. Wild.

Yes, it is a stuff that people generally don't even realize its existence. To my knowledge only RapidJSON and simdjson seriously invested in optimizing this aspect---their authors do know this stuff and difficulty. Others tend to use a performant but not optimal library like double-conversion (which was the SOTA at the time of release!).

> Well OK, now you have a choice between: - include it anyway, [...] - don't include it, [...]

I do not see an issue here. In decoder, one does not need bignum library, just pass bignum as a memory blob to application.

In application, one knows semantic restriction on given values, and either reject bignums as semantically-invalid out-of-range, or need bignum processing library anyways.

Nah it's a pain in the ass if I'm writing a C program to consume your API and I need to pull in MPFR because you used bignums.
A reasonable C API would just give a pointer to decimal digits and a scaling factor. Why did you think MPFR is needed?
You can replace "pull in MPFR" with "work any harder than just using `double`". Bignums are an obvious pain in the ass; I can think of no data representation formats that include support for them and that's why
I'm aware of plenty (though I have surveyed at least 20 formats in the past and so that would include more obscure ones). At the very least, you can feed it back to sscanf if you are fine with an ordinary float or double, a thoughtful API would include this as an option too. That's what I expect for the supposed bignum support: round-trippability.
Maybe an example is useful. I want to build a generic CBOR decoder in C. I have 2 options:

- link GMP/mpdecimal/whatever (or hey, provide an abstraction layer and let a user choose)

- accept function pointers to handle bignum tags

Function pointers are an irritation (I know this because my MP library uses them), they're slower than not using them, you've gotta check for NULL a lot, you're also asking any application that uses your library and wants bignum support to include GMP itself (with all the attendant maintenance, setup, etc.)

Or, you can include it yourself, but welcome to doing all the maintenance yourself, and exposing all of GMP's knobs (ex: [0])

You might argue that these aren't the only options, but a deserialized value has to be understood by the application; your suggestions aren't good tradeoffs. sscanf (also do not use sscanf) doesn't work if the value is actually a bignum, and yielding a bespoke bignum format is just as unusable as simply returning whatever's encoded in CBOR. How would I add two such values together? How would I display it? This is what bignum libraries are for.

All this is made far worse by the fact that there are effectively no public CBOR (or MP) APIs where you're expecting them to be consumed entirely by generic decoders, so there's not even a need to force generic decoders to go through all this effort to support bignums (etc.) Further, unlike MP, CBOR doesn't let you use tags for application-specific purposes. Put it all together and it's uniformly worse: implementations are either more complex or have surprising holes, you can't count on generic decoders supporting tags when building an API or defining messages, and you can't even just say, "for this protocol, tag 31 is a UUID".

This is probably a big reason (though I can think of others) why the only formats you can think of w/ bignum support are obscure.

> That's what I expect for the supposed bignum support: round-trippability.

Round-tripping is only meaningful if a receiver can use the values before reserializing, otherwise memcpy meets your requirements. If a sender gives me a serialized bignum, the deserializing library has to deserialize it into a value I can understand and use; that's the whole point of a deserialization library.

MP's support for timestamps is a reasonable example here: it decomposes into a time_t, and it can do this because it defines the max size. You can't do that w/ a bignum--the whole point of a bignum is it's big beyond defining. A CBOR sender can send you an infinite series of digits, and the spec doesn't reckon with this at all.

[0]: https://gmplib.org/manual/Memory-Management