| HN Mirror

Y	Hacker News new \| ask \| show \| jobs

by noirscape 60 days ago

I can understand in theory why they wouldn't want to back up .git folders as-is. Git has a serious object count bloat problem if you have any repository with a good amount of commit history, which causes a lot of unnecessary overhead in just scanning the folder for files alone.

I don't quite understand why it's still like this; it's probably the biggest reason why git tends to play poorly with a lot of filesystem tools (not just backups). If it'd been something like an SQLite database instead (just an example really), you wouldn't get so much unnecessary inode bloat.

At the same time Backblaze is a backup solution. The need to back up everything is sort of baked in there. They promise to be the third backup solution in a three layer strategy (backup directly connected, backup in home, backup external), and that third one is probably the single most important one of them all since it's the one you're going to be touching the least in an ideal scenario. They really can't be excluding any files whatsoever.

The cloud service exclusion is similarly bad, although much worse. Imagine getting hit by a cryptoworm. Your cloud storage tool is dutifully going to sync everything encrypted, junking up your entire storage across devices and because restoring old versions is both ass and near impossible at scale, you need an actual backup solution for that situation. Backblaze excluding files in those folders feels like a complete misunderstanding of what their purpose should be.

9 comments

stebalien 60 days ago

I've actually spent some time debugging why git causes so many issues with the backup software I use (restic).

Ironically, I believe you have it backwards: pack files, git's solution to the "too many tiny files" problem, are the issue here; not the tiny files themselves.

In my experience, incremental backup software works best with many small files that never change. Scanning is usually just a matter of checking modification times and moving on. This isn't fast, but it's fast enough for backups and can be optimized by monitoring for file changes in a long-running daemon.

However, lots of mostly identical files ARE an issue for filesystems as they tend to waste a lot of space. Git solves this issue by packing these small objects into larger pack files, then compressing them.

Unfortunately, it's those pack files that cause issues for backup software: any time git "garbage collects" and creates new pack files, it ends up deleting and creating a bunch of large files filled with what looks like random data (due to compression). Constantly creating/deleting large files filled with random data wreaks havoc on incremental/deduplicating backup systems.

adithyassekhar 60 days ago

I don’t think this is the right way to see this.

Why should a file backup solution adapt to work with git? Or any application? It should not try to understand what a git object is.

I’m paying to copy files from a folder to their servers just do that. No matter what the file is. Stay at the filesystem level not the application level.

noirscape 60 days ago

I'm not saying Backblaze should adapt to git; the issue isn't application related (besides git being badly configured by default; there's a solution with git gc, it's just that git gc basically never runs).

It's that to back up a folder on a filesystem, you need to traverse that folder and check every file in that folder to see if it's changed. Most filesystem tools usually assume a fairly low file count for these operations.

Git, rather unusually, tends to produce a lot of files in regular use; before packing, every commit/object/branch is simply stored as a file on the filesystem (branches only as pointers). Packing fixes that by compressing commit and object files together, but it's not done by default (only after an initial clone or when the garbage collector runs). Iterating over a .git folder can take a lot of time in a place that's typically not very well optimized (since most "normal" people don't have thousands of tiny files in their folders that contain sprawled out application state.)

The correct solution here is either for git to change, or for Backblaze to implement better iteration logic (which will probably require special handling for git..., so it'd be more "correct" to fix up git, since Backblaze's tools aren't the only ones with this problem.)

masfuerte 60 days ago

7za (the compression app) does blazingly fast iteration over any kind of folder. This doesn't require special code for git. Backblaze's backup app could do the same but rather than fix their code they excluded .git folders.

When I backup my computer the .git folders are among the most important things on there. Most of my personal projects aren't pushed to github or anywhere else.

Fortunately I don't use Backblaze. I guess the moral is don't use a backup solution where the vendor has an incentive to exclude things.

toast0 60 days ago

IMHO, you can't do blazingly fast iteration over folders with small files in Windows, because every open is hooked by the anti-virus, and there goes your performance.

noirscape 60 days ago

Not just antivirus, there's also file locking.

Windows has a much harsher approach to file locking than Linux and backup software like BackBlaze absolutely should be making use of it (lest they back up files that are being modified while they back them up), but that also means that the software effectively has to ask the OS each time to lock the file, then release the lock when the software is done with it. With a large amount of files, that does stack up.

Linux file locking is to put it mildly, deficient. Most software doesn't even bother acquiring locks in the first place. Piling further onto that, basically nobody actually uses POSIX locks because the API has some very heavy footguns (most notably, every lock on a file is released whenever any close() for that file is called, even if another component of the same process is also having a second lock open). Most Linux file locks instead work on the honor system; you create a file called filename.lock in the same directory as the file you're working on, and then any software that detects the filename.lock file exists should stop reading the file.

Nobody using file locks is probably the bigger reason why Linux chokes less on fast iteration than Windows, given that Windows is slow with loads of files even when you aren't running a virus scanner.

buzer 59 days ago

I have never personally used it, but aren't Windows' Shadow Copies supposed to be the answer to file locking/modification issues?

jcgl 59 days ago

> Linux file locking is to put it mildly, deficient.

Since the introduction of flock on Linux, how bad is it really though? I don't see why one would need kludges like filename.lock. Though of course flock is still an "honor system" as you put it.

Tor3 60 days ago

Same - on one of my computers (Linux, btw) the only directories in the list of directories to back up are .git directories. That's what I'm concerned with, so that's what I back up. And it works just fine, with my provider.

NetMageSCW 60 days ago

Actually once the initial backup is done there is no reason to scan for changes. They can just use a Windows service that tells them when any file is modified or created and add that file to their backup list.

DarkUranium 60 days ago

To an extent. WinAPI's file watching has a race condition in it, and there's no simple workaround (just complex & error-prone ones).

Well, for backups the workaround is a bit easier (as they strictly only ever read files), but still.

Saris 60 days ago

Backblaze offers 'unlimited' backup space, so they have to do this kind of thing as a result of that poor marketing choice.

conductr 60 days ago

No they don’t. They just have to price the product to reflect changing user patterns. When backblaze started, it was simply “we back up all the files on your drive” they didn’t even have a restore feature that was your job when you needed it. Over time they realized some user behavior changed, these Cloud drives where a huge data source they hadn’t priced in, git gave them some problems that they didn’t factor in, etc. The issue is there solution to dealing with it is to exclude it and that means they’re now a half baked solution to many of their users, they should have just changed the pricing and supported the backup solution people need today.

adithyassekhar 60 days ago

If they must scam, shouldn’t they be deduplicating on the server rather than the client?

Ajedi32 60 days ago

FWIW some other people in this thread are saying the article is wrong about .git folders not being backed up: https://news.ycombinator.com/item?id=47765788

That's a really important fact that's getting buried so I'd like to highlight it here.

Ajedi32 60 days ago

Well, I checked and it looks like none of my .git repos are backed up. All attempts to restore only restore the working copy. -_- I'm not sure why it was working for the person in the comment I linked.

Ajedi32 56 days ago

Update: Deleting C:\Programdata\Backblaze\bzdata\bzexcluderules_mandatory.xml resolved the problem for me. Seems like at one point[1] they started excluding .git directories by default, got a bunch of backlash, reverted that change, but never changed the setting back for some users (like me).

[1]: https://www.reddit.com/r/backblaze/comments/1cgy93n/i_did_a_...

rmccue 60 days ago

I think it's understandable for both Backblaze and most users, but surely the solution is to add `.git` to their default exclusion list which the user can manage.

maalhamdan 60 days ago

I think they shouldn't back up git objects individually because git handles the versioning information. Just compress the .git folder itself and back it up as a single unit.

willis936 60 days ago

Better yet, include dedpulication, incremental versioning, verification, and encryption. Wait, that's borg / restic.

This is a joke, but honestly anyone here shouldn't be directly backing up their filesystems and should instead be using the right tool for the job. You'll make the world a more efficient place, have more robust and quicker to recover backups, and save some money along the way.

pkaeding 60 days ago

This is a good point, but you might expect them to back up untracked and modified files in the backup, along with everything else on your filesystem.

pixl97 60 days ago

Eh, you really shouldn't do that for any kind of file that acts like a (an impromptu) database. This is how you get corruption. Especially when change information can be split across more than one file.

pkaeding 60 days ago

Sorry, what are you saying shouldn't be done? Backing up untracked/modified files in a bit repo? Or compressing the .git folder and backing it up as a unit?

pixl97 60 days ago

> Backing up untracked/modified files in a bit repo?

This. It's best to do this in an atomic operation, such as a VSS style snapshot that then is consistent and done with no or paused operations on the files. Something like a zip is generally better because it takes less time on the file system than the upload process typically takes.

pkaeding 58 days ago

I see what you mean, but isn't this an issue with any filesystem backup tool? Or is there something about untracked files in a git workspace that is different, that I'm not seeing?

rcxdude 60 days ago

It's probably primarily because Linus is a kernel and filesystem nerd, not a database nerd, so he preferred to just use the filesystem which he understood the performance characteristics of well (at least on linux).

ciupicri 60 days ago

> If it'd been something like an SQLite database instead (just an example really)

See Fossil (https://fossil-scm.org/)

P.S. There's also (https://www.sourcegear.com/vault/)

> SourceGear Vault Pro is a version control and bug tracking solution for professional development teams. Vault Standard is for those who only want version control. Vault is based on a client / server architecture using technologies such as Microsoft SQL Server and IIS Web Services for increased performance, scalability, and security.

grumbelbart2 60 days ago

Git packs objects into pack-files on a regular basis. If it doesn't, check your configuration, or do it manually with 'git repack'.

noirscape 60 days ago

I decided to look into this (git gc should also be doing this), and I think I figured out why it's such a consistent issue with git in particular. Running git gc does properly pack objects together and reduce inode count to something much more manageable.

It's the same reason why the postgres autovacuum daemon tends to be borderline useless unless you retune it[0]: the defaults are barmy. git gc only runs if there's 6700 loose unpacked objects[1]. Most typical filesystem tools tend to start balking at traversing ~1000 files in a structure (depends a bit on the filesystem/OS as well, Windows tends to get slower a good bit earlier than Linux).

To fix it, running

> git config --global gc.auto 1000

should retune it and any subsequent commit to your repo's will trigger garbage collection properly when there's around 1000 loose files. Pack file management seems to be properly tuned by default; at more than 50 packs, gc will repack into a larger pack.

[0]: For anyone curious, the default postgres autovacuum setting runs only when 10% of the table consists of dead tuples (roughly: deleted+every revision of an updated row). If you're working with a beefy table, you're never hitting 10%. Either tune it down or create an external cronjob to run vacuum analyze more frequently on the tables you need to keep speedy. I'm pretty sure the defaults are tuned solely to ensure that Postgres' internal tables are fast, since those seem to only have active rows to a point where it'd warrant autovacuum.

[1]: https://git-scm.com/docs/git-gc

LetTheSmokeOut 60 days ago

I needed to use

> git config --global gc.auto 1000

with the long option name, and no `=`.

Dylan16807 60 days ago

A few thousand files shouldn't be a problem to a program designed to scan entire drives of files. Even in a single folder and considering sloppy programs I wouldn't worry just yet, and git's not putting them in a single folder.

bombcar 60 days ago

I love nothing more than running strange git commands found in HN comments.

Let's ride the lightning and see if it does anything.

yangm97 60 days ago

You don’t see ZFS/BTRFS block based snapshot replication choking on git or any sort of dataset. Use the right job for the tool or something.