Hacker News new | ask | show | jobs
by escherize 3715 days ago
Is there a source for benchmarks/reviews for the various ways to represent data? As far as I see it, there are a lot of them that I'd like to hear pros/cons for: json, edn + transit (my fave), yaml, google protobufs, thrift (?), as well as Ion.

And where does Ion fit here?

3 comments

MessagePack is quite fast and the newest version has binary fields, but it lacks the rich datatypes like decimals and timestamps mentioned by another commenter. If Ion is as fast and has adequate language support, it sounds like it would be a good first choice for a new project.

Edit: There is a benchmark script that tests a few serializers and validators in Ruby in my [employer's] ClassyHash gem: https://github.com/deseretbook/classy_hash/. It would be easy to add more serializers to the benchmark: https://github.com/deseretbook/classy_hash/blob/master/bench...

Ion's advantage is that it's both strongly-typed with a rich type system, as well as self-describing.

Data formats like JSON and XML can be somewhat self-describing, but they aren't always completely. Both tend to need to embed more complex data types as either strings with implied formats, or nested structures. (Consider: How would you represent a timestamp in JSON such that an application could unambiguously read it? An arbitrary-precision decimal? A byte array?) I'm not familiar with EDN, but it appears to be in a similar position as JSON in this regard. ProtocolBuffers, Thrift, and Avro require a schema to be defined in advance, and only work with schema-described data as serialization layers. Ion is designed to work with self-describing data that might be fairly complex, and have no compiled-ahead-of-time schema.

Ion makes it easy to pass data around with high fidelity even if intermediate systems through which the data passes understand only part of the data but not all of it. A classic weakness of traditional RPC systems is that, during an upgrade where an existing structure gains an additional field, that structure might pass through an application that doesn't know about the field yet. Thus when the structure gets deserialized and serialized again, the field is missing. The Ion structure by comparison can be passed from the wire to the application and back without that kind of loss. (Some serialization-based frameworks have solutions to this problem too.)

One downside is that its performance tends to be worse than schema-based serialization frameworks like Thrift/ProtoBuf/Avro where the payload is generally known in advance, and code can be generated that will read and deserialize it. Another downside is that it's difficult to isolate Ion-aware code from the more general purpose "business logic" in an application, due to the absence of a serialization layer producing/consuming POJOs; instead it's common to read an Ion structure from the wire and access it directly from application logic.

EDN supports dates, etc, too.

However, it doesn't support blobs. I'm conflicted about this point. On one hand, small blobs can occasionally be useful to send within a larger payload. On the other hand, small blobs almost always become large blobs, and so I'd rather plan for out-of-band (preferably even content addressable) representations of blobs.

> Another downside is that it's difficult to isolate Ion-aware code from the more general purpose "business logic" in an application, due to the absence of a serialization layer producing/consuming POJOs; instead it's common to read an Ion structure from the wire and access it directly from application logic.

This is indeed a common pitfall, especially since traversing Ion is slow and expensive. I've squeezed up to 30% performance gain by converting Ion data to POJOs up front and just using those.

For JVM most popular benchmark is https://github.com/eishay/jvm-serializers/wiki