Hacker News new | ask | show | jobs
by beagle3 3352 days ago
Damaged files (that can still be read from disk without error) are more likely than not a result of bad memory. It is incredibly unlikely to have bits mutated on disk without triggering a CRC error -- approx 1 in 4 billion errors would go undetected in a CRC-32; and it has to be spread over more than 32-bits-wide. The fact that you had multiple damaged files basically guarantees that it was a faulty memory (or bus, or controller -- but not actual magnetic media corruption) issue.

ZFS can't really help you with that as the data was likely damaged in transit rather than on disk; though with its own 256 bit hash, it is likely to detect those faulty system components earlier than later.

4 comments

All my memory is ECC (other than my barely-used laptop). Been there, bought the t-shirt, decided non-parity can go jump in a lake more than a decade ago.

http://i.imgur.com/uz2inSy.jpg

I encountered this going through a copy of my photos stored on a pair of WD Greens using NTFS 3-4 years ago. The original copy on a ZFS machine was fine. I found a few others, and promptly stopped using those drives.

Two years ago I had repeated bursts of ZFS checksum errors from a pair of SanDisk SSDs. Evidently TRIM didn't quite work perfectly 100% of the time, and caused data corruption - luckily ZFS was always able to repair it, and it being detected meant I could do something about it early - I updated firmware and the issue went away. Last year it came back after an OS update, and I just turned TRIM off completely (I guess it was sensitive to TRIM patterns and those changed).

Last year I also had a Toshiba HDD forget how to IO properly, and got a constant stream of ZFS checksum errors from it until I yanked it from the hot-swap bay and reinserted it. It resilvered and scrubbed fine.

These aren't the only times I've seen checksum errors and silent corruption, they're just the most recent. ZFS lost a file once, and was very noisy about it - the status message for the lost metadata stayed until I recreated the pool. NTFS, UFS2, ext2, all were completely silent on the fact that they were showing me data that was clearly wrong.

I don't trust disks, or IO controllers, and I don't trust filesystems that do. Neither should you.

I also disable TRIM on all my devices. Rather I set aside 7% of disk space that I never touch. It eliminates the need for TRIM without performance degradation.
To mitigate the likeliness of damage during transit, it's recommened to use ECC memory with ZFS. I had FreeNAS running on a system with a faulty memory module. ZFS correctly discovered damage in random files I just transferred over. Luckily, it wasn't too late to replace the bad module and repeat the transfer.

Lessons learned:

- Run MemTest86 on hardware before using it as storage

- Use an FS that does checksums

Unfortunately, the latter doesn't seem to apply to APFS.

Can you always just buy ECC memory (I have DDR3, I guess it's more affordable now since it's a little bit older), or does something else (cpu, motherboard) need to support it as well?
You need a compatible motherboard and CPU.
Which is why Intel kinda sucks. It would cost them basically nothing to have ECC enabled on all of their hardware, but they insist on using it as a differentiator between server and desktop parts.

AMD (at least in the past), includes it on all parts so that it's up to the consumer to choose.

While I agree that it'd be nice if Intel didn't use ECC as a server/desktop delimiter, AMD's stance (at least for AM3 and AM4 parts) appears to have been "the CPUs support it, but we haven't run any validation tests on it; if the motherboard manufacturers want to validate it and turn it on, great".[1][2] Which is not quite the same as it being enabled on all parts.

[1] - http://www.hardwarecanucks.com/forum/hardware-canucks-review...

[2] - https://www.reddit.com/r/Amd/comments/5x4hxu/we_are_amd_crea...

That absolutely does mean it's enabled on all parts. They also don't validate the chips against FreeBSD - does that mean you can't run FreeBSD on their chips? Or do you think it would be ridiculous to expect them to test scenario's outside of the market they're targeting?

Do you have any idea the cost of running validation tests? I'm not the least bit concerned that they haven't "validated" the ECC functionality. It's enabled, they know it works, it's the same ECC they use on server class chips, and if someone found a bug I have no doubt they'd issue microcode to fix it.

Precisely because it is a market segmentation tactic, they do allow ECC on lower end chips which they do not see stealing marketshare from Xeon.

My home media center PC / NAS runs an i3-4370 CPU with ECC RAM

The same is true for AMD's new Ryzen CPUs, ECC support is enabled. You still need a suitable motherboard though.
On linux you can force enable ECC on an unsupported motherboard with

> ecc_enable_override=1

Most motherboards have the extra memory traces in place even if they don't enable ECC.

"cost them basically nothing"

"using it as a differentiator between server and desktop"

so it would cost them SOMETHING...

Something related to artificial milking, as opposed to tangible production costs.
Eh, I've seen an Intel SSD 320 eat a bunch of pages and an HGST 5K4000 4TB direct thousands of writes to the wrong sectors. There are a lot of things that can go wrong with storage devices, not just the optimistic case of a bit getting flipped after being written correctly.
Aside from transferring from disk to every larger disk, how would memory corrupt an mp3? Given Apple's long-time tradition of putting metadata (playcounts/star ratings/etc) in a separate file, an mp3 on a MacOS computer is pretty much write once, read many.
Static data on a flash device (e.g. SSD) is subject to wear leveling, which is a regular re-writing process. This is counter intuitive, but makes sense.

If your flash device never moved the static data, the only flash blocks that would get wear cycles would be the flash blocks that contained dynamic (normally changed and rewritten) data. The result would be the blocks of flash that were not static would quickly wear out and the blocks of flash that were static would have a lot of unused write cycles available.

In order to use all of the wear cycles of all of the blocks, the static data has to be moved regularly so the blocks all have a (roughly) equal number of wear cycles[1]. Every time the data is moved, there is an opportunity for data corruption.

The flash data blocks (typically) have ECC (error checking and correction) which is designed to prevent data corruption. There are limitations to ECC:

* ECC can only correct a limited number of errors.

* Flash memory is not a perfect storage medium, it can "bit rot" too - the primary reason for ECC with flash is to "hide" the inherent bit rotting of flash. "MLC"[2] flash chips aggravate the problem because their margins are smaller.

* If a memory controller does a wear leveling move and the source data is bad, beyond the ability of the ECC to correct, it has no way to correct that error and (generally) has no way to inform the user that their file (system) has suffered corruption.

In Jean-Louis Gassée's anecdote (which is typical), his notification that his wife's files were corrupt was an backup failure notification. The backup failure was telling him that it could not read files, but it was not clear to him (and would not be clear to most users) that the root cause was file corruption, not a backup problem per se.

[1] https://en.wikipedia.org/wiki/Wear_leveling#Static_wear_leve...

[2] https://en.wikipedia.org/wiki/Multi-level_cell