Hacker News new | ask | show | jobs
by shakna 387 days ago
Files are... Flat streams. Sort of.

So if you rewrite an index at the head of the file, you may end up having to rewrite everything that comes afterwards, to push it further down in the file, if it overflows any padding offset. Which makes appending an extremely slow operation.

Whereas seeking to end, and then rewinding, is not nearly as costly.

2 comments

You can do it via fallocate(2) FALLOC_FL_INSERT_RANGE and FALLOC_FL_COLLAPSE_RANGE but sadly these still have a lot of limitations and are not portable. Based on discussions I've read, it seems there is no real motivation for implementing support for it, since anyone who cares about the performance of doing this will use some DB format anyway.

In theory, files should be just unrolled linked lists (or trees) of bytes, but I guess a lot of internal code still assumes full, aligned blocks.

Most workflows do not modify files in place but rather create new files as its safer and allows you to go back to the original if you made a mistake.
If you're writing twice, you don't care about the performance to begin with. Or the size of the files being produced.

But if you're writing indices, there's a good chance that you do care about performance.

Files are often authored once and read / used many times. When authoring a file performance is less important and there is plenty of file space available. Indices are for the performance for using the file which is more important than the performance for authoring it.
If storage and concern aren't a concern when writing, then you probably shouldn't be doing workarounds to include the index in the file itself. Follow the dbm approach and separate both into two different files.

Which is what dbm, bdb, Windows search indexes, IBM datasets, and so many, many other standards will do.

Separate files isn't always the answer. It can be more awkward to need to download both and always keep them together compared to when it's a single file.