Hacker News new | ask | show | jobs
by AndrewDavis 1174 days ago
This happened to me. Lost a bunch of data.

I had backups on an external drive that I'd periodically copy data to.

I can't remember the exact sizes but this will still explain in principle what happened.

I had a 1TB drive in my desktop. I had a 500GB external drive. At time of purchase I had less than 500GB to back up.

At some point in time my desktop hard drive started corrupting data unbeknownst to me.

The amount I needed to backup grew beyond 500GB so I purchased a new larger backup drive. I did a full copy (corruption and all) from my desktop to new backup drive.

At some point I repurposed the old backup drive for something else erasing it. It is at this point I have irrecoverable data loss and I still don't know.

The corruption became so widespread on my desktop drive I became aware of it. I check my backup and discover a non trivial amount of my data was corrupted.

1 comments

I had a similar thing happen to me. The sata controller probably failed.

At first it corrupted a few files. I though nothing of it since I had a few power outages. Then more files. So I reformatted but file corruption kept happening. Switched the drive to a separate chipset with the same cable and all was good.

My current solution to this situation is a Low power PC which runs FreeBSD that has ECC RAM and a ZFS pool consisting of five mirrored drives. This PC gets backups pushed to it from my main workstation and makes a snapshot each time. I plan to change it though to a pull configuration. This way it will be immune to crptolocker software performing privilege escalation attacks since no services will be offered and no credentials will be viewed by the workstation. I have to configure it using its own keyboard though.

Even then the backups need to be tested.

> Even then the backups need to be tested.

Isn't that the role of zfs scrub?

Or do you mean testing if say a JPG file is still a valid JPG?

I think there are scripts that can store a md5 of each file in a sqlite database for filesystems without checksumming such as xfs

I meant tested before restoring. If the same problem as mentioned above were to occur I would have backups of all my files pre corruption though they would be spread across multiple snapshots.

Also from my understanding TCP/IP error correction isn't that that great: https://news.ycombinator.com/item?id=25335936

It's definitely possible to write a script that compares a file across multiple snapshots and flags it if it's content changes but its modification time does not. It will just get tricky when the file gets modified between backups as the file could have been modified then corrupted then backed up. In that case how does the script know that the file has been corrupted?

So your local disk/RAM corrupts data and it gets pushed to the ECC box...

It's all well if you notice it soon enough, but for rarely touched files they can drop off retention and you're left with corrupted copy

True. The snapshots are not rolling though and I don't have much data but you are right. It's not going to be fun picking through my snapshots for individual files if they get corrupted over time.

This seems like an unavoidable issue though when using a workstation without ECC RAM and a copy-on-write filesystem. I thought about moving the files off my workstation to my NAS which stores my media files. This does tick both the CoW and ECC boxes but it's not properly set up yet. Setting up an iscsi target on the NAS is an option but then it gets fidley when trying recover specific files from different points in time since I can't just browse the snapshot like any other filesystem.

Getting ECC memory into workstation should be just "okay, you want it, pay 20% more and you get it", not having to find which combination of CPU,firmware and motherboard is needed for it, it's sad state we're in.
Tell me about it! When I looked at the price of second hand ECC UDIMMs for my server I almost cried.