| Now a more philosoraptor style comment: I see Mcrib is a service built to
quickly detect and replace memcached's. I treat memcached in infrastructure as
a very stable service. Meaning it is infrequently necessary to upgrade it, and
it will generally not fail on its own. If it does it will be highly infrequent
compared to services with higher churn or more complexity/dependencies. This
means if they're failing often enough that you need to rapidly detect and
replace them you have a more fundamental problem. From a structural standpoint I think my technical comment can be useful. If
things really are failing this much A) you should figure out why and slow that
down. B) if you have a generally stable system and understand the typical rate
of failure, you can add tripwires into Mcrib to avoid over-culling services
and loudly raise alarms. Then C) you can improve technical reliability with
redundancy/extstore/etc. I've also seen plenty of times where folks have a dependency of a service
determine if that service is usable, which I disagree with quite strongly.
Consul being down on a node should trigger something to consider if the
service is dead. It's important both for reliability (don't kill perfectly
working things because you end up having to design around it), and for
maintainability as you've now made people afraid of upgrading Consul or other
co-dependent services. Other similar failures are single-point-of-testing
availability checking where instead you probably want two points of truth
before shooting a service. Now you risk people being afraid of upgrading probably anything, which means
they will work around it, abstract it, or needlessly replace it with something
they feel safer managing. The latter is at best a waste of time, at worst a
time bomb until you find out what conditions this new thing breaks under. This isn't advocating that you design without assuming anything can fail
anywhere at any time; just pointing out that how often a service _should_ fail
is extremely useful information when designing systems and designing fail
safes, alerts, monitoring, etc. |
I run memcached at a large scale. You are totally right. Every other year we will find ONE bad memcached node down. We use nutcraker instead of mcrouter for consistent hashing to each memcache node. Once i read "We also run a control plane for the cache tier, called Mcrib. Mcrib’s role is to generate up-to-date Mcrouter configurations" -- I was like oooooh boy, here we go....
Knowing memcache is a rock comes with experience though.