|
|
|
|
|
by ploxiln
3784 days ago
|
|
The quality of your C code running on a single node is by all accounts really good. But you sometimes speak about distributed systems in a way that is not correct to people who have really spent a lot of time making these things work at scale. I've done distributed systems, and in fact I usually don't implement systems that are 100% correct under maximally-adverse conditions, for performance and convenience reasons. But it's important to know and acknowledge where and how the system is less than 100% bullet-proof, so that you can understand and recover from failures. When you have 1000 servers, some virtual in the cloud, some in a colocation facility, and maybe 10 or 20 subtle variants in hardware, you will get failures that should not happen but they do. For example, I've had an AWS host where the system time jumped backwards and forwards by 5 minutes every 30 seconds. And I've seen a tcp load-balancer appliance managed by a colo facility do amazingly wrong things. The point is, you need to understand where the consistency scheme in a distributed system is "fudged", "mostly reliable", "as long as conditions aren't insane", and where they are really guaranteed (which is almost nowhere). Then you can appropriately arrange for "virtually impossible" errors/failures to be detected and contained. This ad-hoc logic you bring to the problem of distributed systems doesn't really help. Not to "dogpile", but elucidate for others who may not know some of the history here: https://aphyr.com/posts/287-asynchronous-replication-with-fa...
and the also interesting reply: http://antirez.com/news/56 |
|
About the ad-hoc logic, in this case I think the problem is actually that you want to accept only what already exists. For example there are important distributed systems papers where the assumption of absolute time bound error among processes is made, using GPS units. This is a much stronger system model (much more synchronous) compared to what I assume, yet this is regarded as acceptable. So you have to ask yourself if, regardless of the fact this idea is from me, if you think that a computer, using the monotonic clock API, can count relative time with a max percentage of error. Yes? Then it's a viable system model.
About you linking to Aphyr, Sentinel was never designed in order to make Redis strongly consistent nor to to make it able to retain writes, I always analyzed the system for what it is: a failover solution that, mixed with the asynchronous replication semantics of Redis, has certain failure modes, that are considered to be fine for the intended use case of Redis. That is, it has just a few best-effort checks in order to try to minimize the data loss, and it has ways to ensure that only one configuration wins, eventually (so that there are no permanent split brain conditions after partitions heal) and nothing more.
To link at this I'm not sure what sense it makes. If you believe Redlock is broken, just state it in a rigorous form, and I'll try to reply.
If you ask me to be rigorous while I'm trying to make it clear with facts what is wrong about Martin analysis, you should do the same and just use arguments, not hand waving about me being lame at DS. Btw in my blog post I'll detail everything and show why I think Redlock is not affected by network unbound delays.