Hacker News new | ask | show | jobs
by jcrites 3715 days ago
What do you mean by they both have self-describing schemas? In order to read or write Avro data, an application needs to possess a schema for that data -- the specific schema that the data was written with, and (when writing) the same schema that a later reader expects to find. This means the data is not self-describing.

Ion is designed to be self-describing, meaning that no schema is necessary to deserialize and interact with Ion structures. It's consequently possible to interact with Ion in a dynamic and reflective way, for example, in the same way that you can with JSON and XML. It's possible to write a pretty-printer for a binary Ion structure coming off the wire without having any idea of or schema for what's inside. Ion's advantage over those formats is that it's strongly typed (or richly typed, if you prefer). For example, Ion has types for timestamps, arbitrary-precision decimals like for currency, and can embed binary data directly (without base64 encoding), etc.

I wouldn't try to say that one or the other is better across the board. Rather, they have tradeoffs and relative strengths in different circumstances. Ion is in part designed to tackle scenarios like where your data might live a really long time, and needs to be comprehensible decades from now (whether you kept track of the schema or not, or remember which one it was); and needs to be comprehensible in a large distributed environment where not every application might possess the latest schema or where coordinating a single compile-time schema is a challenge (maybe each app only cares about some part of the data), and so on. Ion is well-suited to long-lived, document-type data that's stored at rest and interacted with in a variety of potentially complex ways over time. Data data. In the case of a simple RPC relationship between a single client and service, where the data being exchanged is ephemeral and won't stick around, and it's easy to definitively coordinate a schema across both applications, a typical serialization framework is a fine choice.

1 comments

I think it depends on what level you're referring to. If you mean record-level, then I concede that it's not self-describing. However, looking at the suggested use cases, it seems that it's "self-describing" in that you'll always be able to decode data stored according to what the documentation recommends:

"Avro data is always serialized with its schema. Files that store Avro data should always also include the schema for that data in the same file. Avro-based remote procedure call (RPC) systems must also guarantee that remote recipients of data have a copy of the schema used to write that data."

https://avro.apache.org/docs/current/spec.html#Data+Serializ...

That's interesting. I didn't know that about Avro. Does the framework take responsibility for including the schema and defining a format consisting of schema plus data, or is that the responsibility of the application layer? It sounds like that might just be a convention or best practice recommended in the documentation, rather than a technical property of Avro itself.

If it's the application's responsibility to bundle the schema in Avro, then one difference is that Ion takes responsibility for embedding schema information along with each structure and field. Ion is also capable of representing data where there is no schema (analogy: a complex document like an HTML5 page), or working efficiently with large structures without deserializing everything even if the application needs data in just one field.

Another platform in contrast with Ion is Apache Parquet [1]. Parquet's support for columnar data means that it can serialize and compress table-like data extremely efficiently (it serializes all values in one column, followed by the next, until the end of a chunk -- enabling efficient compression as well as efficient column scans). Ion by comparison would serialize each row and field within it in a self-describing way (even though that information is redundant, in this particular case, since all rows are the same). Great flexibility and high fidelity at the expense of efficiency.

[1] https://parquet.apache.org/documentation/latest/

Avro files have a header which has metadata including the schema as well as things like compression codec (supports deflate and snappy) and all of the implementations that I have used (java and python bindings mostly) just does this in the background.

Another fun thing is that avro supports union types, so to make things nullable you just union[null, double] or whatever.

But one of the best things about avro (and parquet for that matter) is that it is well supported by the hadoop ecosystem

In the spec[1] there is a definition of an "object container file" which includes the schema, and is the default format used whenever you save an Avro file. You can even use it whenever sending Avro data through the wire, if you don't mind paying the extra space cost.

[1]: http://avro.apache.org/docs/1.7.7/spec.html

I think libraries generally take care of stuffing the schema into the wire protocol, and I have a hunch you're right in that it's implementation-defined.

I like that in this regard, any individual record in Ion is standalone. I can think of a few ways that could come in handy, e.g., a data packet of nested mixed-version records. Did not know about Paraquet, thanks!

There are some use cases where record-level self description is very useful. For example when dealing with small records in a database or NoSQL store or message queue that could be written by multiple versions of applications. To cover that case well with Avro where records are not self describing really requires something like a schema registry and embedding a schema id with each record (e.g. http://www.confluent.io/blog/schema-registry-kafka-stream-pr... ).
The intent of the stored schema isn't really for self-description. A typical use case for Avro is data storage over long periods of time. It is expected that the schema will evolve at some point during this time. Therefore you still need to specify a target schema to read the data into which is allowed to be different than the stored schema. Avro then maps the stored data into the target schema by using the stored schema. Most avro libraries expect you to get the target schema from a separate source before reading data.