Hacker News new | ask | show | jobs
Ask HN: How do you handle transferring large files over the internet?
8 points by lucasch 3631 days ago
I've often heard that the best way to send large amounts of data is to actually send it via snail mail. But if that is not an option what is the best way to send, say 500GB, over the internet? Past Experience preferred but creative solutions welcome!
9 comments

If, for some reason, you are averse to rsync (there are plenty of Windows clients), then bittorrent is second-best: it works just as well for transferring private data, just turn off all the metadata broadcast options and make sure that encryption is required. (Not so easy for incremental transfers, though)
We considered rsync but were wondering if there were more specialized tools available. We figured that those who work in the scientific community would have a way to transfer their large data sets between institutions.
We transfer large files containing raw radar data, and moderate sized files contains databases of target movements and track information.

We use rsync.

When I worked for the local university we had to transfer data between machines to run experimental parallel programs on so-called "big data."

We used rsync.

Ahh ok thats good to hear. Have you considered using multipath transport protocols with something like rsync? I am curious if it could benefit this situation. MPTCP sounds like an interesting protocol if you control both hosts. https://www.multipath-tcp.org/
We were/are always restricted by intermediate limits on throughput, so it's never been useful or interesting to consider alternatives.

YMMV, but if you want to improve throughput, consider carefully where your data has to go through. But rsync is rock-solid, well-understood, mature, and just does exactly what it is intended to do.

If you have a high-bandwidth link and are in a hurry, use GridFTP (http://toolkit.globus.org/toolkit/docs/latest-stable/gridftp...), otherwise just use rsync.

Scientific institutions that need to transfer large data sets have fast connections. :) How does 340 Gbps sound? Check out ESnet. http://newscenter.lbl.gov/2014/10/20/does-high-speed-network...

I heard about ESnet while consulting at Lawrence Berkeley National Laboratory.

I would use scp with the -C option to compress the data for the first attempt to transfer the file. Any subsequent attempts to transfer the file should use rsync (this includes any updates to the file).

Another option is to use a cloud data storage provider (Dropbox, Box, Google Drive, etc) and install their software that keeps files in sync. Then you can just put the file in a local folder and let their software sync it to the cloud. After sync is complete, you email a link to share the file. Of course, their software is probably just a GUI for rsync.

Why not use rsync for the first transfer as well? There's `-z` for the same effect...

As for commercial providers - I've seen rate limits (never did it saturate the link) and size limits (0.5 TB will cost you extra). Moreover, the data will go through an intermediate hop, which is slow (Dropbox starts downloading only when upload completes, doubling your transfer time), plus your data is now...somewhere (which could be a regulatory issue, depending on the data).

I assume you're transferring between two hosts you both control. Some sane options are:

- rsync

- SFTP-over-SSH, or SFTP over a dedicated VPN

- private, encrypted torrents

SFTP gets tricky on resume; rsync and BT have builtin data integrity checks.
Aspera is often used in the film industry to move large video files around. My understanding is their software takes over some layers of the network stack to make sure the pipe stays saturated.

http://asperasoft.com/

Yep, when I worked in the film industry, we used Aspera or GridFTP (open source).
Do you need to transmit 500GB every time or just a diff from a previous dataset ? If it's the later case, using the send/receive functionality of a file system with snapshot and incremental backup (ZFS,BTRFS etc..) can be significantly faster than using pure rsync. Rsync would needs to scan the complete 500GB of data to find the blocks to send, while send/receive can compute the diff much faster
Negligible W/R/T transfer time. While using native capabilities of ZFS is awesome, you're now locked into a particular FS at both sides of the transfer (this may or may not be an issue).

(Also, there's a patch for rsync that allows you to force computing the checksum in batch, not on each invocation; but that's getting into hairy territory that's rarely needed - I used it in exactly one case so far)

More details please.

* Are we talking a few giant files or thousand of small-ish files?

* Who is on the receiving end? Technical people from who we can expect that they are able to run some command-line stuff or your average joe?

* What type of scenarios must the solution work? What OSes? Is it acceptable to install extra software or must it work out of the box?

etc. etc. etc.

rsync
Indeed. I've been trying all sorts of weird stuff, but this takes the cake. Ubiquitous, rock-solid, sane. Plus, no worrying "is it done yet? Do I have the latest version?" Just let it run (again) - this makes it rather foolproof.
Does rsync auto-resume after failed connections?
I've written a script that every minute tests to see if the appropriate rsync command is running. If not, it simply runs it again. In that way it effectively restarts and gracefully resumes.

This Google search:

https://www.google.co.uk/search?q=rsync+auto-restart+failed+...

returns this link:

http://superuser.com/questions/302842/resume-rsync-over-ssh-...

which contains this script:

    #!/bin/bash

    while [ 1 ]
    do
        rsync -avz --partial source dest
        if [ "$?" = "0" ] ; then
            echo "rsync completed normally"
            exit
        else
            echo "Rsync failure. Backing off and retrying..."
            sleep 180
        fi
    done
The comment says:

    When the connection dies, rsync will quit
    with a non-zero exit code. This script simply
    keeps re-running rsync, letting it continue
    until the synchronisation completes normally.
That's pretty much what I've done.
Wow, thanks for the script. Surprising in its simplicity. I would have thought this use-case was popular enough to warrant specialized tools etc. Especially in the scientific community where they transfer large files.
It's simple enough that it's the sort of thing I type out in 30 seconds and there it is. No need for specialist tools - finding the tool, remembering how to use it, working out the right parameters ...

Easier, faster, and more flexible just to write the script. It's what I do.

Yes, it it does, if you use:

   --partial               keep partially transferred files

   --append                append data onto shorter files
You mean, if the connection fails before completing? I don't think so (I believe that's a feature).

For automated transfers ("retry until done"), I use the lsyncd wrapper (which also watches the source files, so you don't need to poll for changes: it wakes up by inotify).

Sure, just use --partial switch:

   --partial               keep partially transferred files
That is incorrect. This will keep partial transfers in destination (and when invoked again, will continue tranferring from that point onwards), but will not restart an aborted transfer (e.g. for a broken connection).

In other words, this feature is a prerequisite for auto-resume, not auto-resume itself (which can be scripted in ~10 lines, as shown in a sibling thread).

Thanks! But no need to script, you can just use

            --append                append data onto shorter files
to resume interrupted transfers.
btw not relevant in your case, but a friend working for a large video company regularly drives a truckload of 10 TB tapes around the continent - bandwidth is still an issue at that scale.
Amazing. If only there was away to get around the limitations at that level.
Well...two decades ago, 500 GB of data would have been moved as freight, too. Data expands to fill any available capacity - I don't think there is any way around that.
About 9 years ago, while working as a system administration consultant, I had a gig to fly a portable hard drive with about 360 GB from LA to St. Louis as part of a migration of a web application. It was faster than the network connections available to my client at the time. I remember calculating the throughput...

I asked why don't you just FedEx it? It's too important, the client said, and we know and trust you.

It was funny, I had it in a laptop bag, and didn't let go of it except to go through the security scanner... the only thing missing was the handcuff connecting it to my wrist. :)

Well...you are probably not going to throw a hard drive at a client's door and run. A delivery guy might, as it's just another cardboard package, not priceless data (might be easier now with SSDs).

Indeed, getting a trustworthy courier service is so hard that actually sending an in-house employee is worthwhile, even though their hourly rates make this extremely expensive: you are removing tens of abstraction layers, while preserving high degree of control ("Oh, we might have run it over with a truck. And accidentally put it on a plane to New Zealand on hop #3. And they can't seem to find it there.")

Fair enough. :)