Hacker News new | ask | show | jobs
S3sync – Tool belt for managing your S3 buckets (github.com)
40 points by clarete 4650 days ago
I started contributing to s3sync before hearing about the official amazon s3 tool written in python. However, after testing the official tool for a couple minutes, I decided to spend more time working on s3sync and here's the result.

Thanks to Michael Grosser for his support and patches!

8 comments

How does this compare to the existing s3cmd [1] tool? One issue I've had with s3cmd is it tends to use a lot of memory when syncing two large buckets. I'd love to see a tool that was faster / more memory efficient than s3cmd. Would be awesome to see some benchmarks and feature comparison in the README!

[1] http://s3tools.org/s3cmd

Agreed. We support s3cmd:

  ssh user@rsync.net s3cmd get s3://rsynctest/mscdex.exe
... and have for some time ... but we'd be happy to have some diversity there. One thing that sticks out immediately, of course, is that s3cmd is written in python and s3sync is a ruby gem. This makes things difficult since we insist on having no interpreters in our customer chroot...[1]

[1] We "freeze" things like s3cmd and rdiff-backup into binary executables so they don't require python in the environment

So, basically, the main difference is the language they're implemented in. I started working on s3sync cause I found its code easier to read and I really needed to practice my ruby skills.

Although s3cmd is older and has more features, s3sync was designed to grow stable and well tested. I'm definitely planning to write benchmarks to s3sync and improve its performance as much as I can.

Thank you. Does this support IAM Roles?
Not currently! Please feel free to open a ticket about that! Thank you! :)
Ok here you go.

https://github.com/clarete/s3sync/issues/15

Wish I could help, but I don't know ruby :)

The speed of https://github.com/cobbzilla/s3s3mirror is quite impressive, I suggest you give it a try.
A tool that could do parallel downloads of small files would be a winner. We have a lot of small files and use s3cmd. It downloads one at a time. You can finagle it with some xargs magic, but it would be nice to have it built in.
I wrote a simple tool for syncing an S3 bucket to local disk:

https://github.com/newspaperclub/hank

I found the existing s3 syncing tools used lots of memory when dealing with many small files, so I wrote this in Go, trying to carefully manage memory usage by concurrently listing and downloading the files.

We use it for backing up 500GB across 800k files, from an S3 bucket to a ZFS filesystem for snapshotting. It does the job well, usually taking just a few minutes when there are minimal changes.

Totally expecting to see someone build this in go in a couple weeks to do exactly that sort of thing. :)

Wish I had the time, I just realized it would be a fun project to learn Go with.

You could just modify s3cmd to use threads during uploads? Problem is IO bound so the Gil isn't an issue and you wouldn't have to go through the trouble of implementing s3's auth header.
I use this tool extensively and it's awesome: http://sprightlysoft.com/s3sync/ It has the same name. Check out -TransferThreads parameter for paralel uploads/downloads (see Documentation link). I use it for syncing dozens of buckets each containing tens of thousands of small files and it works without a glitch.
This is handy and extends S3's use case by making two-way sync straightforward. My long-term storage is in Amazon Glacier; I wonder how much effort it would take to extend this and make a straightforward process to pull the data from Glacier, sync with S3 and then push back to Glacier.
What does this provide that isn't already provided by the official AWS CLI?

http://docs.aws.amazon.com/cli/latest/reference/s3/index.htm...

Hi, thanks for asking! I definitely tried this guy before putting more effort on s3sync. Unfortunately their cli experience is really poor and its error report didn't help me understand why synchronizing my stuff with s3 was not working.

In the end of the day, I had an unreadable file inside of the directory that I was trying to back up. I found that out using `--debug` option of the official client, but I couldn't actually continue copying the files cause of that error.

When I tried with s3sync, as I expected, it just yielded a warning about the single file I had with problems and kept working until my backup was done!

Sorry for the wall of text, I just think it's funny cause this exact question came to my mind a couple days ago and that's how I answered my self! :)

Have you tried jets3t? (http://www.jets3t.org/) It's got pretty comprehensive tools for managing S3 and has been rock solid for me so far.
You should check out this suite of S3 cmd line tools too - https://github.com/aboisvert/s3cp
What are the pros and cons of using s3sync versus s3cmd?
First things that come to my mind:

s3cmd: * might be considered more robust and battle tested as people already commented here; * is written in python, so if you have a python environment running, you might be more comfortable with it;

s3sync * Smaller codebase, might be easier to keep things simpler and well tested, achieving the same stability with less time/effort; * Error reporting. One of my main reasons to keep working on s3sync was its better error reporting. I just go crazy when something wrong happens and I don't know why. * It's in ruby, if you have a ruby environment and you don't want to add any python dependencies, you might choose it

s3cmd is battle tested and hasn't had any memory issues for me. (100 buckets, ~40GB each bucket)