Hacker News new | ask | show | jobs
by Matthias247 3800 days ago
Afaik newer versions of messagepack added an extra type to have string and binary now seperated.

I read somewhere that CBOR was better designed for extensibility, but don't know anything further about it.

One difference (on the non-technical side) is that CBOR is standardized through IETF.

2 comments

The base types are "unsigned int", "negative int", "binary data", "UTF-8 string", an array of said items, a map (key,value) of said items, extended types (up to 2^64) and tags (again, up to 2^64). There are only 8 "extended types" currently defined (false, true, null, undefined, half float (IEEE 16b), single float (IEEE 32b) and double float (IEEE 64b) and break (used to terminate streaming data)), leaving plenty unused values for future expansion.

Tags are used to apply meta information to a piece of datum. For example, you can tag a UTF-8 string as a URL or tag an array with a reference so it can be referred to elsewhere in the CBOR encoded data (an extension defined by the IETF but outside RFC-7049).

> Afaik newer versions of messagepack added an extra type to have string and binary now seperated.

The problem is that the 'str' type contains arbitrary binary data in an unspecified encoding, and always will, because of backward compatibility. This isn't changed by adding a 'bin' type.

Msgpack decoders in Python, for example, have to give you bytestrings unless you pass an option that promises that 'str's are all encoded in UTF-8.

From https://github.com/msgpack/msgpack/blob/master/spec.md

  Raw
    String extending Raw type represents a UTF-8 string
    Binary extending Raw type represents a byte array
Ah okay, I didn't know there was now a specific String type (and that the one I was calling 'str' is called 'raw'). Does the Python library use it?
I don't even know what to believe anymore. That documentation is referring to two types, with "raw" renamed to "str" plus a new "bin", which is what I thought it was.

But the link you posted referred to three types, where "str" and "bin" subclass "raw", which sounded like it provided a non-backward-compatible "str" that's guaranteed to be text.

They should just add a UTF8 type... I don't know why that wasn't the default for strings all along.