Hacker News new | ask | show | jobs
by heywhatupboys 1217 days ago
don't backup file names. Backup checksums.
2 comments

I agree. And for some stuff you get cryptographic checksums for free.

Backup of Git repositiories:

    ... #  git fsck --full
    error: unable to unpack contents of .git/objects/a2/cf1a9631658799733f43c3b3f0a799696a4b21
    error: a2cf1a9631658799733f43c3b3f0a799696a4b21: object corrupt or missing: .git/objects/a2/cf1a9631658799733f43c3b3f0a799696a4b21
Oops... No matter if it's a malware, the lack of ECC which by bad luck induced a bit flip that wasn't detected (on an otherwise okay Git repo) or a disk failing, it's trivial to detect if the repo is corrupted.

Same for my ripped archive of Audio CDs. The rippers save lots of information and the rips are bitperfect, cross checked with other people's rips' checksums. And the checksums are all there.

For family pictures, I add a checksum to the pictures myself.

Backups aren't really backups until they've been verified :)

People normally say don’t store binaries in git. Is this a big issue if the files don’t change very often? From what I understood the biggest problem is they don’t diff well. With photos not changing very often, can it work?

Anyone tried using git for 500G of photos?

I would love to if it worked, I have my photo collection spread out on multiple computers and merging the edits to the master backup is always a pita. “Was this file removed from copy A or added to copy B”? All those problems just solve themselves with a clear DVCS git history.

Not having the possibility of ever removing photos, to free up space, is of course another issue of git.

Git simply wasn't designed for that and so the key issue with storing binaries in it is what you mentioned last - that the way git works, a full clone has the full history of all the files. Deleting a file in git then doesn't actually delete the file from git history, so a fresh full clone of 500G of photos isn't going to be 500G, it's going to be that, times however many copies exist in history. A shallow clone solves that, and shallow clones supposedly work better these days in latest version of git, but fundamentally you're using a hammer on screws, as it were.

If you're open to new tools, git annex is what you're looking for. The other two options are Subversion, which has some DVCS features these days, or Perforce Helix Core (paid), though I can't vouch for it as I've never used it.

We've been working on some open source tooling called "oxen" that was built for large datasets of images, video, audio, text etc. We wanted to solve the exact problem you're flagging here with git.

Feel free to check it out here https://github.com/Oxen-AI/oxen-release#-oxen would love any feedback!

I guess that 500GB repository would be barely usable.

You should check out Git LFS if you want to do that, as it sounds like a good idea in the first place!

what do you mean adding checksum to the picture, do you add the checksum as a filename suffix eg IMG0001_<checksum>.jpg, something like that? Or do you tuck it into the exif data and have a tool that computes the checksum of the file minus the checksum part.
Yup exactly just adding a suffix. I'm not only backing .jpg files. For example I also backup a few screenshots (some are in .png and some are in .webp format).

So I don't care about the different pictures (or short family movies) format.

I just wrote some Clojure / babashka code to do that. I also truncate the checksum so that the filename doesn't become gigantic: it's not sensitive content, it's just to detect corruption.

Then I can use another computer and generate, say, all the thumbnails of the pictures and do a quick eyeball verification. If it looks correct, later on I can just automatically have the checksums verified.

Funnily enough I got a few old JPG pictures who were corrupt but I ended finding the correct version on older backups.

Checksum then helps too: otherwise you have two files with the same name (say on different HDD), but only one is correct and you don't know which one without manually opening them.

It's not super advanced and maybe a bit overkill but it's not complicated and works fine for my use case.

P.S: I take it another way would be to use a fs that use content-based addressing or does checksumming for me.

yah ZFS is supposed to alert somehow, I've been curious about the actual end user experience for that workflow and how it feels. Restoring from backup for disturbed crcs is excellent, I've been hoping to get into that action myself once I discovered various low priority files had bit rot on them.
I've been playing around with beyond compare snapshot. I've done whole drive snapshots. I'm pretty close to running a diff to see how things have evolved on my drives and see where all the file system activity has shifted around. The files sizes are pretty small, in the MB range, maybe 6 or 10MB I forget.

https://www.scootersoftware.com/v4help/index.html?snapshots....