Hacker News new | ask | show | jobs
by orphea 387 days ago
Some formats are meant to be streamable. And if the stream is not seekable, then you have to read all 12 Gb before you get to the index.

The point is, not all is black and white. Where to put the index is just another trade off.

5 comments

Different trade-offs is why it might make sense to embrace the Unix way for file formats: do one thing well, and document it so that others can do a different thing well with the same data and no loss.

For example, if it is an archival/recording oriented use case, then you make it cheap/easy to add data and possibly add some resiliency for when recording process crashes. If you want efficient random access, streaming, storage efficiency, the same dataset can be stored in a different layout without loss of quality—and conversion between them doesn’t have to be extremely optimal, it just should be possible to implement from spec.

Like, say, you record raw video. You want “all of the quality” and you know all in all it’s going to take terabytes, so bringing excess capacity is basically a given when shooting. Therefore, if some camera maker, in its infinite wisdom, creates a proprietary undocumented format to sliiightly improve on file size but “accidentally” makes it unusable in most software without first converting it using their own proprietary tool, you may justifiedly not appreciate it. (Canon Cinema Raw Light HQ—I kid you not, that’s what it’s called—I’m looking at you.)

On this note, what are the best/accepted approaches out there when it comes to documenting/speccing out file formats? Ideally something generalized enough that it can also handle cases where the “file” is in fact a particularly structured directory (a la macOS app bundle).

Adding to the recording _raw_ video point, for such purposes, try to design the format so that losing a portion of the file doesn't render it entirely unusable. Kinda like how you can recover DV video from spliced tapes because the data for the current frame (+/- the bordering frame) is enough to start a valid new file stream.
And most of them aren't. And even those that are - it's much easier to implement the ability to retrieve the last chunk of file than to deal with significant performance degradation of forced file rewrites.

Think about a format that has all those properties and you've used - PDF. PDFs the size of several 100s of MB aren't rare. Now imagine how it works in your world:

* Add a note? Wait for the file to be completely rewritten and burn 100s of MB of your data to sync to iCloud/Drive.

* Fill a form? Same.

* Add an annotation with your Apple Pencil? Yup, same.

Now look at how it works right now:

- Add a text? Fill a form? Add a drawing? A few KB of data is appended and uploaded.

* Sign the document to confirm authenticy? You got it, a few KB of data at the end.

* Determine which data was added after the document was signed and sign it with another cert? A few bytes.

Do you need to stream the PDF? Load the last chunk to detect the dictionary. If you don't want to do that, configure PDF writer to output the dictionary at the start and you still end up with a better solution.

That’s true, but streamable formats often don’t need an index.

A team member just created a new tool that uses the tar format (streamable), but then puts the index as the penultimate entry, with the last entry just being a fixed size entry with the offset of the beginning of the index.

In this way normal tar tools just work but it’s possible to retrieve a listing and access a file randomly. It’s also still possible to append to it in the future, modulo futzing with the index a bit.

(The intended purpose is archiving files that were stored as S3 objects back into S3.?

Yes, a good point. Each file format must try to optimise for the use cases it supports of course.
make the index a linked data structure. You can then extend it whenever, wherever