Hacker News new | ask | show | jobs
by hijinks 4788 days ago
Not trying to be an ass here or anything but something doesn't add up. I understand the memory corruption idea but I wouldn't think that would replicate to the other postgresql server. So am I right in thinking there was no slave ever here?
2 comments

It really depends on how you have configured replication and what the exact issue was. Postgres replication either works by directly streaming the WAL archive or by manually shipping older archived WAL files. If these files were corrupted on the master, then the slave would also get the corrupted files.

Now the files (and when streaming directly, the packets) have a header containing some metadata and the actual WAL log entries have a fixed formatting, so it's likely that the slave would have detected this corruption (unless you were really unlucky which would then easily replicate the corruption over to the slave).

But that would just lead to the slave stopping to actually replicate. Unless you watch your clients whether they are still ok, streaming from the master and the replication lag is reasonably low, you would not notice the replication stopping. When you fail over, you get to the state which the database was in when the first corrupted packet arrived.

So either you check your slaves, or you use two-phase commit, ensuring that your data has reached the slaves, but that has some serious performance costs.

BTW: I would assume this was far more likely an issue with their storage, not with RAM.

thanks for the explanation
I don't know exactly how the replication in postgres works, but I can think of scenarios where bit flipping in RAM gets propagated to the slave(s) (e.g., newly-generated data which currently resides only in RAM gets corrupted and then fsync'd to disk, at which point it gets replicated).

The real question is, why on earth would they use non-ECC memory on their database server.

> The real question is, why on earth would they use non-ECC memory on their database server.

Perhaps because the database server is a cloud server from someone like Amazon, Mediatemple or Linode where you have no control on the underlying hardware.