Hacker News new | ask | show | jobs
by Tloewald 2845 days ago
Interesting idea. But is the cloud version a complete image (perhaps out of sync)? If so then it’s a performance disaster, if not it’s very fragile.

It seems to me what we really want is a cloud file system with local cache (like Dropbox or iCloud conceptually) so that if our local device is vaporized we have a pretty much up to date logical store alive and well (and we can work on any number of machines). The word “swapping” seems to me to be based on the virtual memory model which means that if anything goes wrong you have two disconnected piles of crap.

At a file level you could theoretically have a giant file that is never wholly local, but how useful is this as a feature in real terms?

3 comments

I think Borg or Tarsnap use the right approach here: a map of blocks, updating a file updates only the changed block(s). It balances the efficiency of updates and the completeness of the copy. Sort of like FAT filesystem, only with block-level deduplication built in.

Of course you don't get a nice mirror of your files right in the cloud, unless you run a separate server that reconstructs it and makes available as traditional buckets.

restic and duplicacy are the newer implementations of block level dedup encrypted backup.

From what I tested, restic has friendlier command line options but duplicacy is technically superior at this point (restore works way faster)

Restic's restore isn't parallelized at all, whereas its backup is. It should be straightforward to improve the restore performance.

https://github.com/restic/restic/pull/1719

I use a Rubric appliance, that does block level dedupe and extends to cloud. I was able to instantiate a multi TB db, from the backup to a physical server in minutes. Extremely impressed .
I decided against a block- level system with Zero because I'm trying to make predictions about which files will be needed next locally and that's hard on a block level, I think.
I am wondering if there is a backup solution that works that way but without requiring a manual time consuming invocation.

Using something like inotify to record changed files and a worker in the background to immediately sync. Like dropbox.

Yes, the cloud version is a complete image (without the file names though) that should be eventually consistent.

And yes, performance is a disaster right now simply because the code is not optimized at all. But the sync to the cloud happens in the background so it should not affect your performance unless you have a "cache miss".

Isn't it a fusion of HSM https://en.wikipedia.org/wiki/Hierarchical_storage_managemen... and continuous backup?

What about often-locally-changed data which are part of a coherent set, the classic case being a file used by a database engine to store data? We nearly always need to mirror/backup a consistent version of it (just after a successful nesting transaction, in the SQL world the upper-level "COMMIT"), but AFAIK for the time being the HSM+backup software cannot detect such a state. trapping existing system calls (fsync and co, in order to copy to the remote storage data in a sync'ed state) but this is not robust because their semantics is not "upon return of this call the whole dataset (in all files) is consistent".

Moreover if the application using the DB engine is not perfect such inconsistency may reside at application level => after a COMMIT the file is consistent for the DB engine, but not for the application.

I wonder if some users of such HSM+backup software felt some major disappointment after restoring an inconsistent version of such a file. Even a minor loss (garbled index) may be hard to detect and lead to a "fork" of the data.

A dedicated system function called to signal "in my set of opened files the data are consistent" would be useful but is AFAIK missing, and even if someone adds it to some libc/kernel it will only be useful when the application code will actually call it.

The kludge is a procedure "order to engine to sync the data ; throttle the engine in 'no write mode' ; create a RO snapshot ; backup the snapshot; unthrottle the engine ; delete the snapshot", which seems not exactly "transparent".

In such a case, you’re better off with a database engine that streams its journal or transaction log to an object store.

Don’t perform data operations at the wrong layer.

Indeed, and this is my point: such tools cannot be generic ("works with any file") and also transparent ("plug & play").
Yes, but those are the preconditions to user adoption.
Author here. Thanks for the Wikipedia link. I think that the software is trying to implement HSM but I didn't know that this is what it's called.

With Zero, all local data is eventually synced to the cloud but usually this only happens after the local file is idle for a while.