Hacker News new | ask | show | jobs
by mattzito 4118 days ago
It seems to me that, in fact, your original idea was, in fact, the correct one - rsync probably would have been the best way to do this (and separately, a truck full of disks probably would have been the other best way).

First, rsync took too long probably because you used just one thread and didn't optimize your command-line options - most of the performance problems with rsync with large filesystem trees comes from using one command to run everything, something like:

rsync -av /source/giant/tree /dest/giant/tree

And the process of crawling, checksumming, storing is not only generally slow, but incredibly inefficient on today's modern multicore processors.

Much better to break it up into many threads, something like:

rsync -av /source/giant/tree/subdir1 /dest/giant/tree/subdir1

rsync -av /source/giant/tree/subdir2 /dest/giant/tree/subdir2

rsync -av /source/giant/tree/subdir3 /dest/giant/tree/subdir3

That alone probably would have dramatically sped things up, BUT you do still have your speed of light issues.

This is where Amazon import/export comes in - do a one-time tar/rsync of your data to an external 9TB array, ship it to Amazon, have them import it to S3, load it onto your local Amazon machines.

You now have two copies of your data - one on s3, and one on your amazon machine.

Then you use your optimized rsync to run and bring it up to a relatively consistent state - i.e. it runs for 8 hours to sync up, now you're 8 hours behind.

Then you take a brief downtime and run the optimized rsync one more time, and now you have two fully consistent filesystems.

No need for drbd and all the rest of this - just rsync and an external array.

I've used this method to duplicate terabytes and terabytes of data around, and 10s of millions of small files. It works, and is a lot fewer moving parts than drbd

5 comments

Whenever someone gripes that rsync/scp is slow, it's usually because they didn't bother to look into proper solutions for their problem. rsync will barely fill the pipe in many/most cases. Using GridFTP or bbcp is usually preferred.
You can also do well relying on the OS scheduler and networking stack by forking off many rsync processes using GNU parallel or xargs.
Which is how I've done it in the past - I'm sure these days there's utilities that will do it for you, but I had a bunch of perl code that would fork off N threads out of a queue and as one exited successfully, kick off another worker.

The issue with xargs back in the day was that you might need to run several hundred rsync processes, and suddenly launching 500+ processes in parallel made your server very very sad. So you needed some basic job queueing system.

$ man xargs

--max-args=max-args

-n max-args

Use at most max-args arguments per command line. Fewer than max-args arguments will be used if the size (see the -s option) is exceeded, unless the -x option is given, in which case xargs will exit.

...

--max-procs=max-procs

-P max-procs

Run up to max-procs processes at a time; the default is 1. If max-procs is 0, xargs will run as many processes as possible at a time. Use the -n option with -P; otherwise chances are that only one exec will be done.

This was also...14?-ish years ago. Sometimes on Solaris 2.6 or 8 boxes. I am pretty sure xargs didn't have that flag back then (which is why I said, "back in the day").
Thanks for the suggestions. Amazon import is great but we were informed they needed 3 weeks to perform the import, we didn't have time for that.
> we were able to move a 9TB filesystem to a different data center and hosting provider in three weeks

But it took you 3 weeks anyway?

In the end it did, we hoped it would be done sooner :) So we wrote the article to make sure other people that need it done fast can do so.
The other trick is to point inotify at the partition where everything is mounted so you have a list of every file that is changed [1].

That way instead of scanning the whole file system you just rsync the files that have changed.

[1] Assuming you can't hook into the app(s) making the changed directly. You can even just look for new/changed files if deletions are not a priority.

Would that many rsyncs on the same volume thrash the disk like crazy?
Yes, it likely would - most disks will top out on read i/o rate, especially on somewhat random access.

rsync in general is quite optimized and usually the limiting factor to a data transfer is the network or disk rather than CPU speed (unless crypto is involved, for example, over a ssh connection)

> (and separately, a truck full of disks probably would have been the other best way).

A truck ? I think two 3.5" 5TB hard drives would be enough, no ?