Hacker News new | ask | show | jobs
Duplicacy: Lock-free deduplication cloud backup tool, with “fair source” license (github.com)
40 points by acrosync 3298 days ago
15 comments

Reported to Github, as Commercial software masquerading as various open free license projects (MIT, GPL, BSD, etc.).

Also, intentional namespace pollution with existing backup tool, which IS gpl'ed.

Not cool. Not cool at all.

____________________________________

(response, since I'm submitting 'too fast'... ):

Github has commercial repos, and private repos.

It's pretty simple, really. If you want the free options on GH, you choose from a list of standard Open Source licenses. https://github.com/blog/1964-open-source-license-usage-on-gi...

It's also asked you create a LICENSE file, to go along with this.

Their license, however, is very much NON-FREE. As in, if I click clone, since I work for an employer of 50k people, I'm in violation. Full stop. And we're not even talking about developing on it, or submitting PR's, or what have you. This is simple copy which puts me in violation.

It's very much against the spirit of GitHub, and probably against the license on GH as well.

And it also is attempting to dilute another project that does similarly. Just so happens they're 2 letters different. Duplicacy vs Duplicity. That's an asshole thing to do.

Here's a few names I just devised: ClouDuplicate , Clouder, DupliCloud, CfC (cloud file cloud)..

Instead, it's very uncool to try to pollute an existing namespace of the same thing. Talking about pro-level bad will here.

The name and the license are two separate issues.

I agree the name is confusing, however this is not intentional. As I explained in the other comment, I chose duplicacy because the domain name was available and this is a very good name for a backup tool (even better than duplicity).

I chose this fair source license because this is basically the only free-for-personal-use license. Many people here ask why I didn't go with a free license like GPL. Here is why. I believe software should be free for personal users, but I don't like for-profit companies using it for free. This software can potentially help companies solve a painful everyday problem (and therefore make more money) and yet there isn't a license to require them to pay if they don't distribute the software. In my opinion, this is extremely unfair to independent developers like me.

The naming clash is unfortunate, but will at some point become inevitable as more tools doing the same thing (de-duplicating backup in this case) are created.

GitHub does not restrict licensing on their public repositories; I'm not interested in declaring myself a shaman for the "spirit of GitHub" to address that point.

http://www.infoworld.com/article/2615869/open-source-softwar...

I do agree with this part of your post:

Not cool. Not cool at all.

Either naivety or guerilla growth hacking / marketing; we'll see how things shake out.

> Their license, however, is very much NON-FREE. As in, if I click clone, since I work for an employer of 50k people, I'm in violation. Full stop. And we're not even talking about developing on it, or submitting PR's, or what have you. This is simple copy which puts me in violation.

I'm not a lawyer, but the user cap would seem to apply to "use" of the software, not simple copying.

https://fair.io/

I wonder why someone/some org would opt for a weird non-standard license over a GPL/Commercial dual-license as is standard and court-tested... or even better try to monetise by offering support/consulting or infrastructure.
IIRC, GitHub doesn't actually impose any restrictions on what license you choose. Their ToS does grant GitHub the right to redistribute your code (for obvious reasons), among other things around GitHub's fork-and-PR model, beyond whatever licensing terms you've chosen. Valve, for example, publishes portions of their Source Engine on GitHub with a non-free license, and I'm pretty sure there are GitHub repos with CC-BY-ND-NC-whatever licenses for various non-code assets.

I don't think this is against the spirit of GitHub, either. I ain't GitHub, though, so that opinion is by no means authoritative.

All this is different from, say, SourceForge, where using SourceForge to host your code did (does?) require licensing your code under a FOSS license.

----

Regardless, still scummy to take a name so close to an existing actually-FOSS project with similar goals. Additionally scummy to call the license "fair" (if it ain't free, it ain't fair), though that's probably not the developer's fault.

> Here's a few names I just devised: ClouDuplicate , Clouder, DupliCloud, CfC (cloud file cloud)..

Or you know, since it's written in Go, how about GoDuplicate, GoBackup, etc.

I like giving people the benefit of the doubt, but it's just so similar and they have so many obvious pun options that even the most uncreative person probably would have come up with a more unique name.

"bup - I can't believe no-one had this idea earlier!"
Wait, how's it masquerading as what?
The name is very similar to the GPL software Duplicity, and the license is not a free software license.
While I'd agree it's similar, it's not the same.

Public projects on GitHub do not need to be under an open source license - there is absolutely no requirement for that anywhere.

Right, I was just explaining the "masquerading".
It's a derivative work of, at least, an LGPL[0] and Apache2[1] project, but does not include any copyright notices or attribution at all.

[0] - https://github.com/gilbertchen/goamz/blob/master/LICENSE

[1] - https://github.com/gilbertchen/azure-sdk-for-go/blob/master/...

Wait, I thought this submission was for Duplicati (https://www.duplicati.com/), which I used (very happily I might add) in the past for cloud backups of my server to S3.
Note that the "fair source" license is a proprietary software license that happens to sound like a free software license.
Some claims:

"It is the only cloud backup tool that allows multiple computers to back up to the same storage simultaneously without using any locks (thus readily amenable to various cloud storage services)"

"What is novel about lock-free deduplication is the absence of a centralized indexing database for tracking all existing chunks and for determining which chunks are not needed any more. Instead, to check if a chunk has already been uploaded before, one can just perform a file lookup via the file storage API using the file name derived from the hash of the chunk."

Tahoe-LAFS's immutable file model (based on convergent encryption) was capable of doing this same thing a decade ago, and also features a pretty nifty capability-based security model:

https://tahoe-lafs.org/trac/tahoe-lafs

Naming chunks by their hashes is not a new idea, but this technique along does not give you a practical backup tool. The deletion of unreferenced chunks becomes a hard problem, and the center piece of lock-free deduplication is the two-step fossil collection algorithm that solves this hard problem.
Tahoe-LAFS supports a mark/sweep-style garbage collection algorithm
Developer here. Duplicacy is built on the concept of Lock-Free Deduplication (https://github.com/gilbertchen/duplicacy/blob/master/DESIGN....), which allows it to backup multiple computers to the same storage without using any locks. Currently it supports local or networked drives, SFTP servers, Amazon S3, Backblaze B2, Microsoft Azure, Google Cloud Storage, Google Drive, OneDrive, Dropbox, and Hubic.

I recently released the source code under the Fair Source 5 License (https://fair.io/) which means it is free for individuals or businesses with less than 5 users. Otherwise the license costs only $20 per user/year.

Questions and suggestions are welcome.

From the link to the license:

> Fair Source has the power to promote diversity within the developer community. To date, contributing to open source has been an expensive proposition for developers. You have to have a stable income and a lot of extra time to work on side projects for free, which means talented developers from underprivileged backgrounds often aren’t able to contribute. Fair Source allows developers to monetize their side projects, which means more people can afford to join the ranks of developers who pursue these initiatives.

I find it funny that some people feel a need to justify charging money for something by coming up with bogus social justice rationalizations.

I agree that people don't need a justification to charge money, but the rationale isn't bogus—it is hard to contribute without a stable income, underrepresented groups in tech tend to make less money in general, and getting paid could help that.

I'm not sure this license is the way to go, though. Unusual licenses tend to turn people off, and it's not clear how profits from this license would go to contributors.

Completely disagree that the reasoning is bogus.

I object to the name. It is clearly an attempt rebrand proprietary licensing by capitalizing on associations with open source. Trademark law of course doesn't apply, and I wouldn't want it to if it did, but this is pretty much the definition of causing confusion in the marketplace.

There are two features that I need that are difficult to setup with most backup systems.

    1. Does this support encryption? 
    2. Can this do one-command restore of files to a previous revision or day?
Yes and Yes.

Duplicacy follows the git/hg command model. To initialize the repository (the backup to be backed up), run the init command:

  duplicacy init repository_id storage_url -e
The -e option turns on the encryption.

To backup:

  duplicacy backup
To restore:

  duplicacy restore -r revision_number [files]
Does Duplicacy use the filesystem events APIs like FSEvents on macOS to minimize the need for a full scan on every backup? CrashPlan has a lot of issues, but this is one of its best features.

Also, how do you see this as being different from Duplicati and Arq?

Duplicacy doesn't use the filesystem events APIs. Instead, it checks the timestamps and sizes to identify modified files and only back up these files by defaults.

The main use case supported by Duplicacy but not any others including Duplicati and Arq is backing up multiple clients to the same storage while still taking advantage of cross-client deduplication. This is because Duplicacy saves each chunk as an individual file using its hash as the file name (as opposed to using a chunk database to maintain the mapping between chunks and actual files), so no locking is required with multiple clients. Another implication is that the lock-free implementation is actually simpler without the chunk database and thus less error-prone.

one of our users wrote a long post (https://duplicacy.com/issue?id=5651874166341632) comparing Duplicacy with other tools including Arq, based on his experience. I also added a comment to that thread comparing Duplicacy with Arq based on my read of their documentation.

In the 'Comparison' section of your readme, can you add in an entry for NetApp's Altavault?
The name is too similar to Duplicity; do you mind renaming?
I didn't want to sound like duplicity intentionally, but duplicacy.com was still available at that time and I thought it was a perfect name for a backup tool...
For what it's worth, I thought this was Duplicity until I read the comment even after I did a quick glance at the github repo. Since you're both in backup, this is going to be very confusing to people.
Same here.

Edit: Come to think of it, it seems quite funny how many backup solutions are named akin to this, while, under the hood, they actually go through great lengths to actually get rid of duplicates. Maybe a name deriving from "condensing" or "shelving" would be more accurate? ;-)

Yup fooled me too. Even downloaded Duplicity
And too similar to duplicati
I think I understand the goals of the "fair source" license but why not make it copyleft all around, and just sell hosted version to small biz, and license exceptions to corporate clients?
Nobody needs the exception so they won't buy it. That business model only works for copyleft libraries.
Depends, if it's something likely to be customized, AGPL might work.
The verdict of the "open source competition" in Duplicacy's README is not entirely accurate. Exclusive locking in the sync'd approach is just the easiest implementation, not the sole possibility. I can't speak for other tools, since I do not know their internals well enough, but I can say about Borg (http://www.borgbackup.org/) that there is no inherent issue in running the important parts of making backups (i.e. uploading and deduplicating data) in parallel. It's just not implemented.

Cloud storage back-ends are a somewhat similar story. It wouldn't be that complex, although locking is a problem due to the EC model of most of these services. Plans have existed for quite some time now to enable this — just no time to implement them, and other features are requested more frequently.

As a user, I don't give credit for features that could be implemented. Somehow they found time to implement this feature and Borg didn't, so they are legitimately ahead in that aspect.
I might be wrong but I want to hear more from you if you're a Borg developer. My understanding is that you may be able to have multiple clients uploading chunks at the same time, but you won't be able to exploit cross-client deduplication if different clients have a similar set of files (OS files or a large code base for instance). Moreover, if your implementation require locks then it would be very hard to extend to cloud services.
Yes, that's right, concurrent addition of the same chunks would generally mean that some work is wasted; so concurrent long running jobs would not synchronize well in this model, and lock-free performs clearly better there.

The only operation which inherently has to be guarded by a lock in Borg is inserting the archive pointer into the manifest (root object, see https://borgbackup.readthedocs.io/en/latest/internals/data-s...). I suppose it would be possible to work around that without locking or to use the usual hacks around EC, put/get/check/get/check?put/get/check?put etc. until it's "probably there".

Deleting / pruning archives would still require a full lock due to the same conceptual issues that your two-phase GC avoids. The same goes for "check".

For about the last year or so I've been looking for an online backup system with the following requirements:

- Off-site storage, preferably not costing too much.

- Option for on-site storage (e.g., to store a backup "in the cloud" and on my NAS)

- Keeps version history, with the associated goodies (purging old backups, etc)

- Able to run on FreeBSD and Linux, with Windows and MacOS being nice to have but not required.

- Able to back up multiple machines to one account.

I strongly suspect that my solution will involve two separate things - one to actually do the backups and another for the storage.

So far, not having looked at Duplicacy, I'm leaning strongly towards attic/borg with rsync.net for off-site storage. At first glance, Duplicacy looks like it will meet my requirements so I will have to give it a closer look before I pick a solution.

You just need Borg. Here's a post I wrote about it (as you say, Borg and rsync.net):

https://www.stavros.io/posts/holy-grail-backups/

I have posted it to maybe help a few people who want to do backups: https://news.ycombinator.com/item?id=14507656

I currently use Attic for backups going to onto my NAS, so one plus for attic/borg is familiarity. I figure that if I'm going to go with rsync.net, I'll switch to borg since it's (as you point out) better maintained.

Are you using rsync.net's "hidden" attic/borg option? This makes the price very attractive.

You mention using "attic check" to guard against bitrot on the provider's storage. How is this in terms of bandwidth used? Does it have to transfer every byte or does it compute a checksum on the encrypted data (since rsync.net doesn't have the raw data) and just send that?

> Are you using rsync.net's "hidden" attic/borg option? This makes the price very attractive.

I am, yes, and it is quite attractive.

> You mention using "attic check" to guard against bitrot on the provider's storage. How is this in terms of bandwidth used? Does it have to transfer every byte or does it compute a checksum on the encrypted data (since rsync.net doesn't have the raw data) and just send that?

It's very bandwidth-efficient, but I have stopped doing that every day, as rsync.net told me they use ZFS and scrub their arrays regularly, so they would discover bit rot early. I only run the check once a month now.

(attic) borg check --repository-only does not transfer any data except informational logs. This is CRC32 only [1]. Borg 1.1 beta has borg check --verify-data, which does full decryption & full MAC + ID checks -- by downloading all data.

Generally speaking, attic has at least one data corruption bug fixed in Borg that make it unsafe to use with remote repositories unless the SSH connection is 100% stable. Attic also has another similar bug that corrupts the created archive when it encounters an I/O error in the repository.

There is a "Migrating from Attic" "sales pitch" (if you like) in the beta docs (-- switch versions in the lower left corner for stable docs): http://borgbackup.readthedocs.io/en/latest/faq.html#migratin...

[1] and of course checksumming and error correction of the file system, if any. Since rsync.net is ZFS the --repository-only check is a stronger than on a plain old file system with no checksums.

I use git-annex which satisfies almost all of those requirements. I don't back up whole machines (why backup the OS when I can reinstall faster), but it otherwise has all those things.
I wasn't clear when I said "multiple machines" - I'm definitely not backing up whole machines!

It would be a case of multiple machines each backing up their /etc, /home, /var, etc to one place.

Have you looked at crashplan? Although I can't say I've tried running it on FreeBSD...
I haven't looked at Crashplan in detail yet, though one of my co-workers uses it for his Macs and is happy with it.

A quick Google search gives me the feeling that it runs under FreeBSD's Linux emulation and the port seems to break occasionally. I could run it on a real Linux in a VM on FreeBSD though, so that's a potential option.

I use CrashPlan on Linux, Windows, and Mac OS X. Works great for me. Saved my bacon a few times (e.g. accidentally deleting my wedding photos, tax information, etc.). Worth checking out.

(Disclosure: I'm a paying customer of CrashPlan, but otherwise have no connection with the service.)

How is this related to the other Duplicity backup software?

http://duplicity.nongnu.org

Not related at all. (Duplicacy != Duplicity)

Duplicity is a pretty straight good old-fashioned incremental backup program.

Duplicacy on the other hand is hash-based deduplication (BorgBackup / Attic, Restic etc. are some others).

The design of Duplicacy is slightly different from that of e.g. BorgBackup. Duplicacy, as the title says, uses a lock-free approach. BorgBackup and the handful of open source tools in the same spirit use a synchronized approach.

I had been using rclone (https://rclone.org/) for Amazon S3, which has some of the same features but recently the application key was blocked by Amazon. Is duplicacy safe from the same fate?
I think Amazon only blocked rclone's application key for Amazon Drive. There is no way for Amazon to prevent a third-party application from accessing S3, since users provide their own S3 credentials and Amazon doesn't know who is on the other side.
Can someone explain how it's able to make small updates, e.g. to s3? How does it know what's already there -- cache? How does it prune old chunks -- will there be tons of individual API requests to S3?
We use a pack-and-split approach -- files are packed first (as if it is building a big tar file first, although this is only conceptual) and then split into chunks using a variable-sized chunking algorithm. You can customize the chunk size but by default the average chunk size is 4MB so you won't be uploading too much small files.
I haven't read the code (read: speculation ahead!), but at least the "what's already there" part seems rather easy to me if the backups are performed in a chunk-based, deduplicated way (cf. also borg backup[1] and restic[2]): First, you perform a GET BUCKET[3], which gives you a list of all files in the bucket. If you name your chunk files after their hashes, that's all the info you need about which chunks you still have to upload. You can then proceed to chunk your local files and upload the missing parts.

The only question remaining would be the amount of data (i.e. filenames) you'll have to download per amount of data in the backups, which you can vary by adjusting the chunking size.

[1]: https://github.com/borgbackup/borg

[2]: https://restic.github.io/

[3]: http://docs.aws.amazon.com/AmazonS3/latest/API/RESTBucketGET...

Edit: Of course, because S3's PUT OBJECT[4] is idempotent in this case (i.e. ignoring hash collisions as their probability should be orders of magnitude lower than a doomsday scenario), you could of course just transfer each chunk every time. Realistically, all this would do is hog your bandwidth and ruin your performance. That's why it's possible to make the whole thing lock-free; otherwise you could always run into the problem of uploading the same chunk twice.

[4]: http://docs.aws.amazon.com/AmazonS3/latest/API/RESTObjectPUT...

Played with it for 20 minutes, really struggled with the GUI/UX. Have raised some issues (really wanted to like it), but just feels really clunky for something asking for cash.
Its encryption scheme and threat model seems to be similar to cryfs's [1].

[1] https://www.cryfs.org/

What is the encryption standard being used for file encryption?
The website makes it sound just like Tarsnap. Am I wrong? Is there some compelling feature I am missing that would make me want to switch from Tarsnap?
AFAIK Tarsnap backups go through Tarsnap's server but every other backup tool seems to not have that requirement.
Does this mean Tarsnap de-dupes before encrypting? That doesn't seem to make sense but I don't see any other reason going through their server would be required.
Tarsnap uses content based hashing too. The pipeline is basically: tar | chunk | encrypt | upload-new-chunks

The tarsnap server provides a transactional KV store-- "In order to create a new archive, the tarsnap client sends a "write transaction start" request, many "write data" requests, and a "commit transaction" request to the tarsnap server; deleting an archive is similar (except with a "delete transaction start" and "delete data" requests)." http://www.daemonology.net/blog/2008-12-14-how-tarsnap-uses-...

I think it deduplicates after encrypting, if I recall correctly.