| I posted this comment to the ACM article, but they're pre-moderating and apparently someone's shirking their responsibilities... so I'm reposting here. Hope you guys don't mind: Hey Michael, Interesting read. I'm with you on most points. I do have a couple comments though... Failure mode 1 is an application error. Failure mode 2 should result in a failed write. It shouldn't be too hard to trap errors programmatically and handle them intelligently / not propagate them. Of course the devil's in the details and hardware / interpreter / other problems in components that are outside of the DBs control can make things more difficult. These are the sorts of issues that tend to stick around until a system is "battle tested" and run in a couple of large / high volume operations. Failure modes 3, 4, 5, 6 (partition in a local cluster) - this seems to be where the meat of your argument is... but you totally gloss over your solution. I'm not sure I believe that network partitions (even single node failures) are easily survived by lots of algorithms... Or, more specifically, I don't believe that they can be survived while maintaining consistency (in the CAP sense, not the ACID sense). I threw together a super simplified "proof" of why consistency is essentially impossible in this situation in a recent talk. See http://www.slideshare.net/mmalone/scaling-gis-data-in-nonrel... - slides 16 through 20. What algorithms are there to get around this? If a replica is partitioned you either can't replicate to it and have to fail (unavailable) or you can't replicate to it and succeed anyways (replica is inconsistent). I also don't buy the argument that partitions (LAN or WAN) are rare and therefore we shouldn't worry about them. For a small operation this may be true, but when you're doing a million operations a second then a one-in-a-million failure scenario will happen every second. Failure mode 7 will probably result in some data loss unless (as you mention) you're willing to live with the latency of waiting for durable multi-datacenter writes to occur. But having that option is definitely nice, and that's a trade off that I'd like to be able to make on a per-write basis. I may choose to accept that latency when I'm recording a large financial transaction, for example. Another thought related to this issue - in a lot of ways writing something to memory on multiple nodes is more "durable" than writing it to disk on one. So you may be able to do multi-DC replicated writes in memory with tolerable latency assuming your DCs are close enough that the speed of light isn't limiting. That should get you durability up to the point where the entire eastern seaboard disappears, at least. Failure mode 8 is another core issue that I think you're glossing over. WAN failures (particularly short-lived ones) can and do happen on a regular basis. It's true that routing issues are typically resolved quickly, but it's another law-of-large-numbers thing. Amazon AWS had an issue that took an entire data center offline a while back, for example. Shit happens. In CAP terms this is really the same thing as a failure modes 3, 4, 5, 6, and 7 though. So the same arguments apply. Re: your argument that only a small segment splits - what happens when a read comes into the small split segment (maybe from a client in the same datacenter)? If data has been updated on the larger segment it couldn't have been replicated, so again you're either serving stale data or your data store is unavailable. Thanks for putting this together, it was an interesting read. Looking forward to hearing more about some of these issues! |