| HN Mirror

Y	Hacker News new \| ask \| show \| jobs

by jcrites 3715 days ago

That's interesting. I didn't know that about Avro. Does the framework take responsibility for including the schema and defining a format consisting of schema plus data, or is that the responsibility of the application layer? It sounds like that might just be a convention or best practice recommended in the documentation, rather than a technical property of Avro itself.

If it's the application's responsibility to bundle the schema in Avro, then one difference is that Ion takes responsibility for embedding schema information along with each structure and field. Ion is also capable of representing data where there is no schema (analogy: a complex document like an HTML5 page), or working efficiently with large structures without deserializing everything even if the application needs data in just one field.

Another platform in contrast with Ion is Apache Parquet [1]. Parquet's support for columnar data means that it can serialize and compress table-like data extremely efficiently (it serializes all values in one column, followed by the next, until the end of a chunk -- enabling efficient compression as well as efficient column scans). Ion by comparison would serialize each row and field within it in a self-describing way (even though that information is redundant, in this particular case, since all rows are the same). Great flexibility and high fidelity at the expense of efficiency.

[1] https://parquet.apache.org/documentation/latest/

3 comments

aeroevan 3715 days ago

Avro files have a header which has metadata including the schema as well as things like compression codec (supports deflate and snappy) and all of the implementations that I have used (java and python bindings mostly) just does this in the background.

Another fun thing is that avro supports union types, so to make things nullable you just union[null, double] or whatever.

But one of the best things about avro (and parquet for that matter) is that it is well supported by the hadoop ecosystem

link

andrioni 3715 days ago

In the spec[1] there is a definition of an "object container file" which includes the schema, and is the default format used whenever you save an Avro file. You can even use it whenever sending Avro data through the wire, if you don't mind paying the extra space cost.

[1]: http://avro.apache.org/docs/1.7.7/spec.html

link

wyc 3715 days ago

I think libraries generally take care of stuffing the schema into the wire protocol, and I have a hunch you're right in that it's implementation-defined.

I like that in this regard, any individual record in Ion is standalone. I can think of a few ways that could come in handy, e.g., a data packet of nested mixed-version records. Did not know about Paraquet, thanks!

link