|
|
|
|
|
by jnewhouse
3355 days ago
|
|
If you want more details, we were packing a Row class into a base64 encoded string using an ObjectOutputStream. This is a fine thing for small scale serialization but sucks at scale, because of the reasons mentioned in the post.
Sorry we don't have code examples, but it's unclear how useful it'd be given that no one else uses our file format. If you want a bit more detail on how the format works. Each metadata contains a list of typed columns to define the schema of a given part. Our map-reduce framework has a bunch of internal logic that tries to justify the written Row class with the one the Mapper class is asking for. This allows us to do things like ingest different versions of a row with in the context of a single job.
I think questions of serialization at the scale are generally interesting, although ymmv. I know of one company using Avro, which doesn't let you cleanly update or track schema. They've ended up storing every schema in an HBase table and reserving the first 8 bytes to do a lookup into this table to know the row's schema. |
|
I agree serialization at scale is interesting. My particular interest right at this moment is in efficiently doing incremental updates of HDFS files (Parquet & Avro) from observing changes in MySQL tables - not completely trivial because some ETL with joins and unions is required to get data in the right shape.