Hacker News new | ask | show | jobs
by wyc 3715 days ago
I think it depends on what level you're referring to. If you mean record-level, then I concede that it's not self-describing. However, looking at the suggested use cases, it seems that it's "self-describing" in that you'll always be able to decode data stored according to what the documentation recommends:

"Avro data is always serialized with its schema. Files that store Avro data should always also include the schema for that data in the same file. Avro-based remote procedure call (RPC) systems must also guarantee that remote recipients of data have a copy of the schema used to write that data."

https://avro.apache.org/docs/current/spec.html#Data+Serializ...

3 comments

That's interesting. I didn't know that about Avro. Does the framework take responsibility for including the schema and defining a format consisting of schema plus data, or is that the responsibility of the application layer? It sounds like that might just be a convention or best practice recommended in the documentation, rather than a technical property of Avro itself.

If it's the application's responsibility to bundle the schema in Avro, then one difference is that Ion takes responsibility for embedding schema information along with each structure and field. Ion is also capable of representing data where there is no schema (analogy: a complex document like an HTML5 page), or working efficiently with large structures without deserializing everything even if the application needs data in just one field.

Another platform in contrast with Ion is Apache Parquet [1]. Parquet's support for columnar data means that it can serialize and compress table-like data extremely efficiently (it serializes all values in one column, followed by the next, until the end of a chunk -- enabling efficient compression as well as efficient column scans). Ion by comparison would serialize each row and field within it in a self-describing way (even though that information is redundant, in this particular case, since all rows are the same). Great flexibility and high fidelity at the expense of efficiency.

[1] https://parquet.apache.org/documentation/latest/

Avro files have a header which has metadata including the schema as well as things like compression codec (supports deflate and snappy) and all of the implementations that I have used (java and python bindings mostly) just does this in the background.

Another fun thing is that avro supports union types, so to make things nullable you just union[null, double] or whatever.

But one of the best things about avro (and parquet for that matter) is that it is well supported by the hadoop ecosystem

In the spec[1] there is a definition of an "object container file" which includes the schema, and is the default format used whenever you save an Avro file. You can even use it whenever sending Avro data through the wire, if you don't mind paying the extra space cost.

[1]: http://avro.apache.org/docs/1.7.7/spec.html

I think libraries generally take care of stuffing the schema into the wire protocol, and I have a hunch you're right in that it's implementation-defined.

I like that in this regard, any individual record in Ion is standalone. I can think of a few ways that could come in handy, e.g., a data packet of nested mixed-version records. Did not know about Paraquet, thanks!

There are some use cases where record-level self description is very useful. For example when dealing with small records in a database or NoSQL store or message queue that could be written by multiple versions of applications. To cover that case well with Avro where records are not self describing really requires something like a schema registry and embedding a schema id with each record (e.g. http://www.confluent.io/blog/schema-registry-kafka-stream-pr... ).
The intent of the stored schema isn't really for self-description. A typical use case for Avro is data storage over long periods of time. It is expected that the schema will evolve at some point during this time. Therefore you still need to specify a target schema to read the data into which is allowed to be different than the stored schema. Avro then maps the stored data into the target schema by using the stored schema. Most avro libraries expect you to get the target schema from a separate source before reading data.