Hacker News new | ask | show | jobs
by thasso 387 days ago
For archive formats, or anything that has a table of contents or an index, consider putting the index at the end of the file so that you can append to it without moving a lot of data around. This also allows for easy concatenation.
2 comments

What probably allows for even more easier concatenation would be to store the header of each file immediately preceding the data of that file. You can make a index in memory when reading the file if that is helpful for your use.
This would require a separate seek and read operation per archive member, each yielding only one directory entry, rather than very few read operation to load the whole directory at once.
Why not put it at the beginning so that it is available at the start of the filestream that way it is easier to get first so you know what other ranges of the file you may need?

>This also allows for easy concatenation.

How would it be easier than putting it at the front?

Files are... Flat streams. Sort of.

So if you rewrite an index at the head of the file, you may end up having to rewrite everything that comes afterwards, to push it further down in the file, if it overflows any padding offset. Which makes appending an extremely slow operation.

Whereas seeking to end, and then rewinding, is not nearly as costly.

You can do it via fallocate(2) FALLOC_FL_INSERT_RANGE and FALLOC_FL_COLLAPSE_RANGE but sadly these still have a lot of limitations and are not portable. Based on discussions I've read, it seems there is no real motivation for implementing support for it, since anyone who cares about the performance of doing this will use some DB format anyway.

In theory, files should be just unrolled linked lists (or trees) of bytes, but I guess a lot of internal code still assumes full, aligned blocks.

Most workflows do not modify files in place but rather create new files as its safer and allows you to go back to the original if you made a mistake.
If you're writing twice, you don't care about the performance to begin with. Or the size of the files being produced.

But if you're writing indices, there's a good chance that you do care about performance.

Files are often authored once and read / used many times. When authoring a file performance is less important and there is plenty of file space available. Indices are for the performance for using the file which is more important than the performance for authoring it.
If storage and concern aren't a concern when writing, then you probably shouldn't be doing workarounds to include the index in the file itself. Follow the dbm approach and separate both into two different files.

Which is what dbm, bdb, Windows search indexes, IBM datasets, and so many, many other standards will do.

If the archive is being updated in place, turning ABC# into ABCD#' (where # and #' are indices) is easier than turning #ABC into #'ABCD. The actual position of indices doesn't matter much if the stream is seekable. I don't think the concatenation is a good argument though.
Imagine you have a 12Gb zip file, and you want to add one more file to it. Very easy and quick if the index is at the end, very slow if it's at the start (assuming your index now needs more space than is available currently).

Reading the index from the end of the file is also quick; where you read next depends on what you are trying to find in it, which may not be the start.

Some formats are meant to be streamable. And if the stream is not seekable, then you have to read all 12 Gb before you get to the index.

The point is, not all is black and white. Where to put the index is just another trade off.

Different trade-offs is why it might make sense to embrace the Unix way for file formats: do one thing well, and document it so that others can do a different thing well with the same data and no loss.

For example, if it is an archival/recording oriented use case, then you make it cheap/easy to add data and possibly add some resiliency for when recording process crashes. If you want efficient random access, streaming, storage efficiency, the same dataset can be stored in a different layout without loss of quality—and conversion between them doesn’t have to be extremely optimal, it just should be possible to implement from spec.

Like, say, you record raw video. You want “all of the quality” and you know all in all it’s going to take terabytes, so bringing excess capacity is basically a given when shooting. Therefore, if some camera maker, in its infinite wisdom, creates a proprietary undocumented format to sliiightly improve on file size but “accidentally” makes it unusable in most software without first converting it using their own proprietary tool, you may justifiedly not appreciate it. (Canon Cinema Raw Light HQ—I kid you not, that’s what it’s called—I’m looking at you.)

On this note, what are the best/accepted approaches out there when it comes to documenting/speccing out file formats? Ideally something generalized enough that it can also handle cases where the “file” is in fact a particularly structured directory (a la macOS app bundle).

Adding to the recording _raw_ video point, for such purposes, try to design the format so that losing a portion of the file doesn't render it entirely unusable. Kinda like how you can recover DV video from spliced tapes because the data for the current frame (+/- the bordering frame) is enough to start a valid new file stream.
And most of them aren't. And even those that are - it's much easier to implement the ability to retrieve the last chunk of file than to deal with significant performance degradation of forced file rewrites.

Think about a format that has all those properties and you've used - PDF. PDFs the size of several 100s of MB aren't rare. Now imagine how it works in your world:

* Add a note? Wait for the file to be completely rewritten and burn 100s of MB of your data to sync to iCloud/Drive.

* Fill a form? Same.

* Add an annotation with your Apple Pencil? Yup, same.

Now look at how it works right now:

- Add a text? Fill a form? Add a drawing? A few KB of data is appended and uploaded.

* Sign the document to confirm authenticy? You got it, a few KB of data at the end.

* Determine which data was added after the document was signed and sign it with another cert? A few bytes.

Do you need to stream the PDF? Load the last chunk to detect the dictionary. If you don't want to do that, configure PDF writer to output the dictionary at the start and you still end up with a better solution.

That’s true, but streamable formats often don’t need an index.

A team member just created a new tool that uses the tar format (streamable), but then puts the index as the penultimate entry, with the last entry just being a fixed size entry with the offset of the beginning of the index.

In this way normal tar tools just work but it’s possible to retrieve a listing and access a file randomly. It’s also still possible to append to it in the future, modulo futzing with the index a bit.

(The intended purpose is archiving files that were stored as S3 objects back into S3.?

Yes, a good point. Each file format must try to optimise for the use cases it supports of course.
make the index a linked data structure. You can then extend it whenever, wherever
> How would it be easier than putting it at the front?

Have you ever wondered why `tar` is the Tape Archive? Tape. Magnetic recording tape. You stream data to it, and rewinding is Hard, so you put the list of files you just dealt with at the very end. This now-obsolete hardware expectation touches us decades later.

tar streams don't have an index at all, actually, they're just a series of header blocks and data blocks. Some backup software built on top may include a catalog of some kind inside the tar stream itself, of course, and may choose to do so as the last entry.
IIRC, the original TAR format was just writing the 'struct stat' from sys/stat.h, followed by the file contents for each file.
But new file formats being developed are most likely not going to be designed to be used with tapes. If you want to avoid rewinds you can write a new concatenated version of the files. This also allows you to keep the original in case you need it.