RAID arrays fail all the time; the system has famously been one server, and the only visible recent scaling work has been front end caching.
edit: the code has been public for a long time, and there is not a database to replicate. the site ran as a single server for years, and it is unlikely the front end caching has changed anything about the "database" components.
Since RAID failures actually are somewhat common, they are probably looking at a higher level replicated storage system now, a la DRBD, or some kind of distributed file system, a la Gluster.
Deosn't RAID usually at least give some warning if you watch the syslogs? (Genuine question, I am not a sysadmin, we have linux servers with hetzner on software raid 1 and a couple have had single-disk issues which we spotted straight away in zenoss and had hetzner replace the disk. Am I incorrect in thinking this is normal?)
RAID is a method for surviving hardware failure. If you have a software failure in, say, the VFS layer, RAID will happily accept the order to write garbage all over your inode trees and will carefully store and make sure that all the appropriate disks can return the same garbage every time. And yes, it should warn you when you need to replace a disk which is no longer returning the right garbage.
Similarly, if you rm -rf a vital directory tree, RAID can ensure that it goes away reliably.
yes you're right. so replies will now switch to how they don't stop you from deleting data, because... well, i have no idea why. it seems to just be a law of nature.
DRBD and Gluster are not any more resilient to filesystem corruption than a RAID device is. In this kind of case you hope for either real-time replicated storage on a completely separate physical host or very recent backups.
Back in ~2004 I watched IT spend a whole day recovering our 60-person startup's main Linux NFS server, due to a software bug in the storage driver. Had to rebuild the whole system from backups.
Yes, I have in fact, in a DRBD configuration. The bug was esoteric, but it happened and was not the result of user error. DRBD and Gluster both allow faults in the VFS layer to propagate to all replicas.
Gluster should by design I think avoid replicating filesystem metadata corruption (but would replicate internal metadata issues in files on top of the filesystem) but DRBD won't... At high volumes I still regularly break Gluster but it'd probably be OK for lower bandwidth/ops use. Not sure what the HN disk usage pattern is though.
I guess he meant having two separate logs. One for production, and secondary with his journal. In this case you could restore from backup the original data, and then replay rest of stuff from the external log. That's the solution I'm using with really important data where I cannot afford any data loss, even if down time is acceptable. On commit, it committed to two separate systems, but the secondary system is only journal which can be replayed.
edit: the code has been public for a long time, and there is not a database to replicate. the site ran as a single server for years, and it is unlikely the front end caching has changed anything about the "database" components.
Since RAID failures actually are somewhat common, they are probably looking at a higher level replicated storage system now, a la DRBD, or some kind of distributed file system, a la Gluster.