| HN Mirror

Y	Hacker News new \| ask \| show \| jobs


	by thasso 387 days ago
	For archive formats, or anything that has a table of contents or an index, consider putting the index at the end of the file so that you can append to it without moving a lot of data around. This also allows for easy concatenation.

2 comments

zzo38computer 386 days ago

What probably allows for even more easier concatenation would be to store the header of each file immediately preceding the data of that file. You can make a index in memory when reading the file if that is helpful for your use.

link

HelloNurse 386 days ago

This would require a separate seek and read operation per archive member, each yielding only one directory entry, rather than very few read operation to load the whole directory at once.

link

charcircuit 387 days ago

Why not put it at the beginning so that it is available at the start of the filestream that way it is easier to get first so you know what other ranges of the file you may need?

>This also allows for easy concatenation.

How would it be easier than putting it at the front?

link

shakna 387 days ago

Files are... Flat streams. Sort of.

So if you rewrite an index at the head of the file, you may end up having to rewrite everything that comes afterwards, to push it further down in the file, if it overflows any padding offset. Which makes appending an extremely slow operation.

Whereas seeking to end, and then rewinding, is not nearly as costly.

link

PhilipRoman 387 days ago

You can do it via fallocate(2) FALLOC_FL_INSERT_RANGE and FALLOC_FL_COLLAPSE_RANGE but sadly these still have a lot of limitations and are not portable. Based on discussions I've read, it seems there is no real motivation for implementing support for it, since anyone who cares about the performance of doing this will use some DB format anyway.

In theory, files should be just unrolled linked lists (or trees) of bytes, but I guess a lot of internal code still assumes full, aligned blocks.

link

charcircuit 387 days ago

Most workflows do not modify files in place but rather create new files as its safer and allows you to go back to the original if you made a mistake.

link

shakna 387 days ago

If you're writing twice, you don't care about the performance to begin with. Or the size of the files being produced.

But if you're writing indices, there's a good chance that you do care about performance.

link

charcircuit 387 days ago

Files are often authored once and read / used many times. When authoring a file performance is less important and there is plenty of file space available. Indices are for the performance for using the file which is more important than the performance for authoring it.

link

shakna 387 days ago

If storage and concern aren't a concern when writing, then you probably shouldn't be doing workarounds to include the index in the file itself. Follow the dbm approach and separate both into two different files.

Which is what dbm, bdb, Windows search indexes, IBM datasets, and so many, many other standards will do.

link

lifthrasiir 387 days ago

If the archive is being updated in place, turning ABC# into ABCD#' (where # and #' are indices) is easier than turning #ABC into #'ABCD. The actual position of indices doesn't matter much if the stream is seekable. I don't think the concatenation is a good argument though.

link

MattPalmer1086 387 days ago

Imagine you have a 12Gb zip file, and you want to add one more file to it. Very easy and quick if the index is at the end, very slow if it's at the start (assuming your index now needs more space than is available currently).

Reading the index from the end of the file is also quick; where you read next depends on what you are trying to find in it, which may not be the start.

link

orphea 387 days ago

Some formats are meant to be streamable. And if the stream is not seekable, then you have to read all 12 Gb before you get to the index.

The point is, not all is black and white. Where to put the index is just another trade off.

link

strogonoff 387 days ago

Different trade-offs is why it might make sense to embrace the Unix way for file formats: do one thing well, and document it so that others can do a different thing well with the same data and no loss.

For example, if it is an archival/recording oriented use case, then you make it cheap/easy to add data and possibly add some resiliency for when recording process crashes. If you want efficient random access, streaming, storage efficiency, the same dataset can be stored in a different layout without loss of quality—and conversion between them doesn’t have to be extremely optimal, it just should be possible to implement from spec.

Like, say, you record raw video. You want “all of the quality” and you know all in all it’s going to take terabytes, so bringing excess capacity is basically a given when shooting. Therefore, if some camera maker, in its infinite wisdom, creates a proprietary undocumented format to sliiightly improve on file size but “accidentally” makes it unusable in most software without first converting it using their own proprietary tool, you may justifiedly not appreciate it. (Canon Cinema Raw Light HQ—I kid you not, that’s what it’s called—I’m looking at you.)

On this note, what are the best/accepted approaches out there when it comes to documenting/speccing out file formats? Ideally something generalized enough that it can also handle cases where the “file” is in fact a particularly structured directory (a la macOS app bundle).

link

trinix912 387 days ago

Adding to the recording _raw_ video point, for such purposes, try to design the format so that losing a portion of the file doesn't render it entirely unusable. Kinda like how you can recover DV video from spliced tapes because the data for the current frame (+/- the bordering frame) is enough to start a valid new file stream.

link

izacus 386 days ago

And most of them aren't. And even those that are - it's much easier to implement the ability to retrieve the last chunk of file than to deal with significant performance degradation of forced file rewrites.

Think about a format that has all those properties and you've used - PDF. PDFs the size of several 100s of MB aren't rare. Now imagine how it works in your world:

* Add a note? Wait for the file to be completely rewritten and burn 100s of MB of your data to sync to iCloud/Drive.

* Fill a form? Same.

* Add an annotation with your Apple Pencil? Yup, same.

Now look at how it works right now:

- Add a text? Fill a form? Add a drawing? A few KB of data is appended and uploaded.

* Sign the document to confirm authenticy? You got it, a few KB of data at the end.

* Determine which data was added after the document was signed and sign it with another cert? A few bytes.

Do you need to stream the PDF? Load the last chunk to detect the dictionary. If you don't want to do that, configure PDF writer to output the dictionary at the start and you still end up with a better solution.

link

jonstewart 387 days ago

That’s true, but streamable formats often don’t need an index.

A team member just created a new tool that uses the tar format (streamable), but then puts the index as the penultimate entry, with the last entry just being a fixed size entry with the offset of the beginning of the index.

In this way normal tar tools just work but it’s possible to retrieve a listing and access a file randomly. It’s also still possible to append to it in the future, modulo futzing with the index a bit.

(The intended purpose is archiving files that were stored as S3 objects back into S3.?

link

MattPalmer1086 387 days ago

Yes, a good point. Each file format must try to optimise for the use cases it supports of course.

link

vrighter 387 days ago

make the index a linked data structure. You can then extend it whenever, wherever

link

McGlockenshire 387 days ago

> How would it be easier than putting it at the front?

Have you ever wondered why `tar` is the Tape Archive? Tape. Magnetic recording tape. You stream data to it, and rewinding is Hard, so you put the list of files you just dealt with at the very end. This now-obsolete hardware expectation touches us decades later.

link

jclulow 387 days ago

tar streams don't have an index at all, actually, they're just a series of header blocks and data blocks. Some backup software built on top may include a catalog of some kind inside the tar stream itself, of course, and may choose to do so as the last entry.

link

ahoka 387 days ago

IIRC, the original TAR format was just writing the 'struct stat' from sys/stat.h, followed by the file contents for each file.

link

charcircuit 387 days ago

But new file formats being developed are most likely not going to be designed to be used with tapes. If you want to avoid rewinds you can write a new concatenated version of the files. This also allows you to keep the original in case you need it.

link