Hacker News new | ask | show | jobs
by zackmorris 4717 days ago
Man, I wrote two rather long responses and then resigned myself to the fact that you are right. This is a very difficult problem.

The only insight I gained was that since billing isn't a realtime task, availability shouldn't matter in this case. IMHO that's why what happened was considered a "mistake".

Some thoughts:

Ideally, they should be able to structure their database so that a high number of transactions can be queued (up to the limit of virtual memory) and open connections shouldn't have to be maintained for each one. Then they could get rid of the notion of timeouts, as long as the master is up a high enough percentage of the time to handle the load. If it isn't, then even the clients would block when they fill up their virtual memory until the master catches up.

If they rewrite their code, they should start with the assumption that the master is normally down. They would have to solve the problem of two transactions conflicting on the client side. So for example, give each transaction a hash or digest that uniquely identifies it as a payment for a certain month. So if one client blocks for a day, and someone tries to start a payment on another client which also blocks, the master would resolve it when it comes back up by committing only one transaction for the hash and failing the other one.

If they get all that logic right, then they could fire off a payment and when the client server says it's queued, they could move on and not need a receipt for the commit. Although I just realized if the client server crashes, they have no guarantee. So they would have to start the transactions on 2 or more client servers, whichever number statistically ensures a high enough guarantee, or possibly all of the clients. That sounds like a lot of overhead at first until you realize that each transaction only takes a few k of memory to represent.

I think that possibly this points to a general solution. Give each transaction a unique ID based on its intent and context, and start it on enough client servers to provide a high enough statistical success rate. Make sure the clients store the pending transactions in nonvolatile memory so they can resume if the power goes out. I'm sure I've forgotten something but surely there is a way to solve this specific subset of problems so that companies could do a sanity check on themselves and avoid repeating history.