|
Hello ploxiln, to see systems with clocks jumping is indeed possible, I'll make it more clear in my blog post, but basically it is possible with some effort to prevent this in general, but more specifically the Martin suggestion of using the monotonic time API is a good one and was already planned. It's very hard to argue that the monotonic time jumps back and forth AFAIK, so I think it's a credible system model. The monotonic API surely avoids a lot of pitfalls that you can have with gettimeofday() and the sysadmin poking the computer time in one way or the other (ntpd or manually). About the ad-hoc logic, in this case I think the problem is actually that you want to accept only what already exists. For example there are important distributed systems papers where the assumption of absolute time bound error among processes is made, using GPS units. This is a much stronger system model (much more synchronous) compared to what I assume, yet this is regarded as acceptable. So you have to ask yourself if, regardless of the fact this idea is from me, if you think that a computer, using the monotonic clock API, can count relative time with a max percentage of error. Yes? Then it's a viable system model. About you linking to Aphyr, Sentinel was never designed in order to make Redis strongly consistent nor to to make it able to retain writes, I always analyzed the system for what it is: a failover solution that, mixed with the asynchronous replication semantics of Redis, has certain failure modes, that are considered to be fine for the intended use case of Redis. That is, it has just a few best-effort checks in order to try to minimize the data loss, and it has ways to ensure that only one configuration wins, eventually (so that there are no permanent split brain conditions after partitions heal) and nothing more. To link at this I'm not sure what sense it makes. If you believe Redlock is broken, just state it in a rigorous form, and I'll try to reply. If you ask me to be rigorous while I'm trying to make it clear with facts what is wrong about Martin analysis, you should do the same and just use arguments, not hand waving about me being lame at DS. Btw in my blog post I'll detail everything and show why I think Redlock is not affected by network unbound delays. |
"Redlock" is a fancy name, involves multiple nodes and some significant logic, less experienced developers will assume it's bulletproof.
"Redis Sentinel" is a fancy name, involves multiple nodes and some significant logic, less experienced developers will assume it's bulletproof.
I guess I'd prefer that people shoot themselves in the foot with their own automatic failover schemes, instead of using a publicized ready-to-use system which has marketing materials (name, website, spec, multiple implementations) which lead them to believe it is bulletproof, and these systems end up being used in for a bunch of general-purpose cases without specific consideration for their failure modes. In this case, the "marketing" is just the fact that you're behind it, and redis really is solid for what it is.
(Usually I avoid automatic promotion/fail-over altogether. It's often not a good idea. Instead I find some application-specific way of operating in a degraded state until a human operator can confirm a configuration change. So service is degraded for some minutes, but potential calamity of automatic systems doing wrong things with lots of customer data is averted.)
It's really not your responsibility to make sure the open source software you offer to the world for free is used appropriately. But people like me will be annoyed when it contributes to the over-complicated-and-not-working messes (not directly your fault) we have to clean up :)
One specific comment: http://redis.io/topics/distlock notes the random value used to lock/unlock to prevent unlocking a lock from another process. But at that point, the action just before the unlock could very well also have been done after the lock was acquired by another process. So while this is a good way to keep the lock state from being corrupted, it is really a side-note rather than a prominent feature, since if this happens, the data the lock was protecting could well have been corrupted.