Hacker News new | ask | show | jobs
by vinkelhake 2995 days ago
If you were interested in RecordIO, then this project might also be of interest to you: https://github.com/google/riegeli
2 comments

Pretty neat work. Especially the corruption detection/skipping and seek support. That part was super ugly in RecordIO proper, and relied on stars properly aligning and the absence of cosmic radiation. RecordIO being the default format for everything at Google, it's not something they can really fix though. Taking care of concatenation is a nice touch as well.

As someone who has spent quite a bit of time at Google working on a high performance file format (not RecordIO):

1. I'd also add LZ4 and/or Snappy for the cases where they are more Pareto-optimal (i.e. fast, network attached, remote storage, such as SSD Colossus, or its external proxy: SSD Persistent Disk).

2. IMO HighwayHash is overkill here, and the author should have used CRC32C instead. You don't particularly care about collisions in this case, you're detecting data corruption. CRC32C is perfect for that, and it's hardware accelerated in almost all recent Intel and ARM CPUs, and it's half the size on disk.

3. It'd be pretty cool to introduce some kind of metadata which would tell the user what type of message is encoded in the file. This is not something RecordIO has, but internal tools can guess most of the time because they have all the proto definitions at their disposal. There's no need to store it in every header, just the first one. I would advise against storing the full schema (that can get very gnarly in the presence of proto dependencies and extensions), but just have something lightweight, i.e. message name and perhaps SCM revision number or hash in the file header, so that the user (or the external system consuming the files) could somewhat reliably establish what the format is later on, when the proto definition drifts. Otherwise, this being a binary serialized file format, it's very easy to end up in a situation where you have some files from years ago and you no longer know how to read them. And yes, I'm aware that SCM hash can change if history is edited.

Interesting. I wonder how different that is from RecordIO. Also, whether there'll be a Go implementation.

[Edit, after looking a bit.]

Pretty different. If I remember correctly, RecordIO is re-synchronizing, whereas Riegeli seems to break things up into 64KB chunks, splitting messages across chunks if necessary.

[Edit, after finding more information.]

Interesting… looks like Riegeli is intended to compress well, rather than just store sequentially. https://encode.ru/threads/2895-Riegeli-%E2%80%94-a-new-compr...

IIRC (but memory has faded considerably), RecordIO also did support something to aid compression across records (rather than just offer per-record compression). There was some gnarly code in it to that effect where there could be a compressed subset of several records within the file. But I might be wrong.