Hacker News new | ask | show | jobs
by adobeeee 2783 days ago
I have a layman question if somebody could please answer. I have never in my entire life seen databases fail. But db failures and issues seem to be brought up all the time. Now I understand that part if this maybe the cost function associated with them. But I'm sure there's also something that I have no clue about. So my questions are:

1) what kind of problems do databases actually face.

2) what kind of scenarios create those problems.

3) how does a programmer go about testing them?

3 comments

The easiest scenario to imagine is a hardware failure or power outage. The database was in the middle of doing something, and then was prevented by a hard drive dying or the lights going out. One way to test such a thing is to literally unplug the computer to see how it handles the failure.

So, let's say you have a client/server application... the client is telling the server (database) to write some records to the database. In the middle of the write, you pull the plug. Some questions you'd want to know: what does the database look like when it restarts? Can we read it? What is the current state? Did any of the new data get written? What does the client think was written? If there was an uncommitted database transaction, was the database left unaltered?

It's just as important to test the client in these scenarios. While the server may have crashed, what does the client think happened? Was it waiting for an ACK or "OK" message? Did it get the message? If the update failed, what does the client do in that situation?

Things can get even more complicated if you're thinking of replication across different servers. If one of the servers fails, how does the replication work? Do sessions fail over to other servers? How many servers are required? If there was a corrupted record, did it propagate or was it scrubbed?

Thank you for explaining so well and clearly!

To you and others, are there any other scenarios too that happen in production?

Enough material here to scare anyone about databases

https://jepsen.io/talks

Some things I have seen in production:

The disk become inoperative during a write, this can be either silent or writes start to return errors. Again, how does the database look like after the problem is solved.

A large operation exceeds the capacity of the server to deal with intermediary state. It runs out of memory, disk, or in some not great DBs it loses control of some locks and gets deadlocked. Can it recover with only the partial log data?

Disks lie about data being written, what happens if one of the problems happen between the disk saying the data was written and it actually getting written?

And, of course, when you move beyond a single server things get way more complex.

You'd be surprised by how often I've seen a database fail in prod simply because it ran out of disk space. In both cases monitoring software was running but misconfigured.
We ship a consumer application with databases in it (multiple). Even the ACID ones fail all the time, maybe 1-3% of our userbase has had a corruption at some point.
Can you talk about the reasons of the failure?