| HN Mirror

Y	Hacker News new \| ask \| show \| jobs

by shakna 387 days ago

Files are... Flat streams. Sort of.

So if you rewrite an index at the head of the file, you may end up having to rewrite everything that comes afterwards, to push it further down in the file, if it overflows any padding offset. Which makes appending an extremely slow operation.

Whereas seeking to end, and then rewinding, is not nearly as costly.

2 comments

PhilipRoman 387 days ago

You can do it via fallocate(2) FALLOC_FL_INSERT_RANGE and FALLOC_FL_COLLAPSE_RANGE but sadly these still have a lot of limitations and are not portable. Based on discussions I've read, it seems there is no real motivation for implementing support for it, since anyone who cares about the performance of doing this will use some DB format anyway.

In theory, files should be just unrolled linked lists (or trees) of bytes, but I guess a lot of internal code still assumes full, aligned blocks.

link

charcircuit 387 days ago

Most workflows do not modify files in place but rather create new files as its safer and allows you to go back to the original if you made a mistake.

link

shakna 387 days ago

If you're writing twice, you don't care about the performance to begin with. Or the size of the files being produced.

But if you're writing indices, there's a good chance that you do care about performance.

link

charcircuit 387 days ago

Files are often authored once and read / used many times. When authoring a file performance is less important and there is plenty of file space available. Indices are for the performance for using the file which is more important than the performance for authoring it.

link

shakna 387 days ago

If storage and concern aren't a concern when writing, then you probably shouldn't be doing workarounds to include the index in the file itself. Follow the dbm approach and separate both into two different files.

Which is what dbm, bdb, Windows search indexes, IBM datasets, and so many, many other standards will do.

link

charcircuit 386 days ago

Separate files isn't always the answer. It can be more awkward to need to download both and always keep them together compared to when it's a single file.

link