| HN Mirror

Y	Hacker News new \| ask \| show \| jobs

by stereosteve 1879 days ago

From a quick read of the SnowFS source code, it looks like it splits large files into 100Mb blocks and builds up a zip of blocks over time. A version of a file is an ordered list of hashes for the blocks in that version.

I like the simplicity of this! But is it at all problematic if something changes early in the file and all the subsequent blocks boundaries shift causing many new blocks to be created?

rsync uses a sliding window to handle this situation. The implementation would be more complicated, but have you considered using librsync internally?

3 comments

sebastian_io 1879 days ago

I am currently working on the compression, as it is not complete yet. The 100 MB is indeed excessive but the window is dynamic and can differ from file to file since it is written to a `*.hblock` file which is stored next to the object in the object database https://github.com/Snowtrack/SnowFS/blob/03e5f839326e666c891...

Let me explain where the 100 MB window comes from as its not only related to the upcoming compression implementation. Some graphic applications touch the timestamps of their files for no reason, making it harder to detect if a file changed. But some file formats always change their 'header' or 'footer'. Means, comparing the hash of the first or last 100 MB of a file that is 8GB in size gives a great performance boost to detect if a file got modified.

link

digikata 1879 days ago

There's a large set of different algorithms with a sliding window. Another interesting one is the Rabin fingerprint. This kind of chunking is often used in storage file systems w/ deduplication and snapshot features.

https://en.wikipedia.org/wiki/Rabin_fingerprint

link

high_byte 1879 days ago

cool. although I think with 4mb window it would be more efficient. 100mb seems excessive, then I assume you wouldn't need a sliding window. (if it works well enough for 100mb)

link

stereosteve 1879 days ago

the problem happens with any fixed window spacing regardless of the block size.

If you create a block every Xmb... inserting a single byte at the beginning of the file will change every subsequent block.

link

411111111111111 1879 days ago

You're technically speaking wrong, but I'm sure the author doesn't want to reimplement block storage devices... So the spirit of the message is probably correct

link

stereosteve 1879 days ago

Oh I'm not talking about disks... this is based on how SnowFS (the library for this project) splits up big files into chunks:

https://github.com/Snowtrack/SnowFS/blob/main/src/common.ts#...

The intent is a simple form of delta encoding, the hope is that many chunks will be common between two versions.

link

sebastian_io 1879 days ago

I should clarify this. The 100 MB window in SnowFS is currently unrelated to compression as it is only used to compare if a block changed. Each block gets a hash. This is a fallback used for some file formats where the mtime timestamp cannot be trusted. Some files have a change in the first block e.g. 100 MB and that is faster to compare than an entire 8GB file. But this window size is dynamic and can be changed and used for compression in the future

link

stereosteve 1879 days ago

Ahh this is my bad. For some reason I assumed the blocks were part of the storage scheme, but I see they only are used to compute hash, and that the whole file is added to zip. Sorry for the misunderstanding!

link