Hacker News new | ask | show | jobs
by bilekas 264 days ago
This is actually kind of cool, I've implemented my own version of this for my job and seems to be something that's important when the numbers gets tight, but if I remember correctly for their case i guess, wouldn't it have been easier to work from rsynch?

> scp always copies full files, there is no "delta mode" to copy only the things that changed, it is slow for many small files, and there is no fast compression.

I havent tried it myself but doesnt this already suit that requirement ? https://docs.rc.fas.harvard.edu/kb/rsync/

> Compression If the SOURCE and DESTINATION are on different machines with fast CPUs, especially if they’re on different networks (e.g. your home computer and the FASRC cluster), it’s recommended to add the -z option to compress the data that’s transferred. This will cause more CPU to be used on both ends, but it is usually faster.

Maybe it's not fast enough, but seems a better place to start than scp imo.

2 comments

> The remote diffing algorithm is based on CDC. In our tests, it is up to 30x faster than the one used in rsync (1500 MB/s vs 50 MB/s).
rsync in my experience is not optimized for a number of use cases.

Game development, in particular, often involves truly enormous sizes and numbers of assets, particularly for dev build iteration, where you're sometimes working with placeholder or unoptimized assets, and debug symbol bloated things, and in my experience, rsync scales poorly for speed of copying large numbers of things. (In the past, I've used naive wrapper scripts with pregenerated lists of the files on one side and GNU parallel to partition the list into subsets and hand those to N different rsync jobs, and then run a sync pass at the end to cleanup any deletions.)

Just last week, I was trying to figure out a more effective way to scale copying a directory tree that was ~250k files varying in size between 128b and 100M, spread out across a complicatedly nested directory structure of 500k directories, because rsync would serialize badly around the cost of creating files and directories. After a few rounds of trying to do many-way rsync partitions, I finally just gave the directory to syncthing and let its pregenerated index and watching handle it.

Try this: https://alexsaveau.dev/blog/projects/performance/files/fuc/f...

> The key insight is that file operations in separate directories don’t (for the most part) interfere with each other, enabling parallel execution.

It really is magically fast.

EDIT: Sorry, that tool is only for local copies. I just remembered you're doing remote copies. Still worth keeping in mind.