| HN Mirror

Y	Hacker News new \| ask \| show \| jobs

by rajatarya 1325 days ago

XetHub Co-founder here. Yes, one illustrative example of the difference is:

Imagine you have a 500MB file (lastmonth.csv) where every day 1MB is changed.

With file-based deduplication every day 500MB will be uploaded, and all clones of the repo will need to download 500MB.

With block-based deduplication, only around the 1MB that changed is uploaded and downloaded.

3 comments

unqueued 1324 days ago

I combine git-annex with the bup special remote[1], which lets me still externalize big files, while benefiting from block level deduplication. Or depending on your needs, you can just use a tool like bup[2] or borg directly. Bup actually uses the git pack file format and git metadata.

I actually wrote a script which I'm happy to share, that makes this much easier, and even lets you mount your bup repo over .git/annex/objects for direct access.

[1]: https://git-annex.branchable.com/walkthrough/using_bup/

[2]: https://github.com/bup/bup

link

AustinDev 1325 days ago

Have you tested this out with Unreal Engine blueprint files? If you all can do block-based diffing on those, and other binary assets used in game development it'd be huge for game development.

I have a couple ~1TB repositories I've had the misfortune of working with using perforce in the past.

link

vvanders 1324 days ago

Last time I used perforce in anger it did pretty decent with ~800GB repo(checkout+history).

I keep expecting someone to come along and dethrone it but as far as I can tell it hasn't been done yet. The combination of specific filetree views, drop-in proxies, UI-forward and checkout based workflow that works well with unmergeable binary assets still left Git LFS and other solutions in the dust.

+1 on testing this against a moderate size gamedev repo, that usually has some of the harder constraints where code + assets can be coupled and the art portion of a sync can easily top a couple hundred GB.

link

AustinDev 1323 days ago

1TB of checkout is the kind of repo I'm talking about I have two such repos checked out on this box currently. I'm not sure I've ever checked out a repo of this scale locally with history. I'd love to have the local history.

link

rajatarya 1325 days ago

Not yet. Would be happy to try - can you point me to a project to use?

Do you have a repo you could try us out with?

We have tried a couple Unity projects (41% smaller due to republication) but not much from Unreal projects yet.

link

AustinDev 1324 days ago

Most of my examples of that size are AAA game source that I can't share however, I think this is a project using similar files that is based on unreal. It should show if there is any benefit: https://github.com/CesiumGS/cesium-unreal-samples & where the .umap binaries have been updated and in this example where the .uasset blueprints have been updated https://github.com/renhaiyizhigou/Unreal-Blueprint-Project

link

civilized 1325 days ago

Does that work equally well whether the changes are primarily row-based or primarily column-based?

link

prirun 1325 days ago

HashBackup author here. Your question is (I think) about how well block-based dedup functions on a database - whether rows are changed or columns are changed. This answer is how most block-based dedup software, including HashBackup work.

Block-based dedup can be done either with fixed block sizes or variable block sizes. For a database with fixed page sizes, a fixed block size matching the page size is most efficient. For a database with variable page sizes, a variable block size will work better, assuming there the dedup "chunking" algorithm is fine-grained enough to detect the database page size. For example, if the db used a 4-6K variable page size and the dedup algo used a 1M variable block size, it could not save a single modified db page but would save more like 20 db pages surrounding the modified page.

Your column vs row question depends on how the db stores data, whether key fields are changed, etc. The main dedup efficiency criteria are whether the changes are physically clustered together in the file or whether they are dispersed throughout the file, and how fine-grained the dedup block detection algorithm is.

link

rajatarya 1325 days ago

Yes, see this for more details of how XetHub deduplication: https://xethub.com/assets/docs/xet-specifics/how-xet-dedupli...

link