| HN Mirror

Y	Hacker News new \| ask \| show \| jobs


	by mdavidn 1259 days ago
	The technology to detect and recover from disk failures does exist. RAID and ZFS, for example. I would not expect a disk failure to replicate to the backup.

4 comments

ct520 1259 days ago

Working with critical infrastructure and lack of true in depth oversight it wouldn’t surprise me DR plans were not ever executed or exercised in a meaningful manner.

link

ChrisMarshallNY 1259 days ago

This is quite common.

Comprehensive DR testing is really difficult. Many orgs settle for “on paper,” or “in theory” substitutions for real testing.

They do it right; no problem.

Doing it right, though … there’s the rub …

link

tanelpoder 1259 days ago

Yep and if you ship WAL transaction logs to standby databases/replicas, corrupt blocks or lost writes in the primary database won't be propagated to the standbys (unlike with OS filesystem or storage-level replication).

Edit: Should add "won't be silently propagated"

link

ilyt 1259 days ago

Neither checks the checksum on every read as that would be performance-prohibitive. So "bad data on drive -> db does something with corrupted data and saves corrupted transformation back to disk" is very much possible, just extremely unlikely.

But they said nothing about it being bad drive, just corrupted data file, which very well might be software bug or operator error

link

sigotirandolas 1259 days ago

This is wrong, both ZFS and btrfs verify the checksum on every read.

It's not typically a performance concern because computing checksums is fast on modern hardware. Besides, historically IO was much slower than CPU.

link

guenthert 1259 days ago

> Neither checks the checksum on every read as that would be performance-prohibitive.

It is expensive. It might be prohibitive in a very competitive environment. This is hardly the case here. Safety first!

link

ExoticPearTree 1257 days ago

RAID does not really protect you from bit rot that tends to happen from time to time. ZFS might because it checksums the blocks. But if the corruption happens in memory and then it is transferred to disk and replicated, then from a disk perspective the data was valid.

link