> "if the RAM is the weakest link it doesn't matter what underlying FS is used"
When is this ever the case, though? Even non-ECC RAM is more reliable than hard drives except when a RAM chip has died, which most ECC systems don't handle well either. Full chipkill capability is pretty rare.
As I understand the reasoning, it's that ZFS checksumming creates a risk of false positives where non-checksumming filesystems wouldn't have them. So without ECC RAM you're trading the risk of losing data to your IO chain for the risk of losing data to increased RAM activity. With ECC RAM it's an unambiguous win.
It sounds to me like there must be some situation where ZFS reads data from disk, verifies the checksum, keeps it in memory where it could get corrupted, then recomputes the checksum on the corrupted version and writes either the corrupted data or the incorrect checksum to disk but not both, so that the data on disk is now guaranteed to not match the checksum on disk.
Without something like the above sequence of events, I don't see how ECC eliminates any class of errors, rather than just reducing their probability. So what actually triggers such a chain of events?
When is this ever the case, though? Even non-ECC RAM is more reliable than hard drives except when a RAM chip has died, which most ECC systems don't handle well either. Full chipkill capability is pretty rare.