Hacker News new | ask | show | jobs
by dormando 1514 days ago
Now a more philosoraptor style comment: I see Mcrib is a service built to quickly detect and replace memcached's. I treat memcached in infrastructure as a very stable service. Meaning it is infrequently necessary to upgrade it, and it will generally not fail on its own. If it does it will be highly infrequent compared to services with higher churn or more complexity/dependencies. This means if they're failing often enough that you need to rapidly detect and replace them you have a more fundamental problem.

From a structural standpoint I think my technical comment can be useful. If things really are failing this much A) you should figure out why and slow that down. B) if you have a generally stable system and understand the typical rate of failure, you can add tripwires into Mcrib to avoid over-culling services and loudly raise alarms. Then C) you can improve technical reliability with redundancy/extstore/etc.

I've also seen plenty of times where folks have a dependency of a service determine if that service is usable, which I disagree with quite strongly. Consul being down on a node should trigger something to consider if the service is dead. It's important both for reliability (don't kill perfectly working things because you end up having to design around it), and for maintainability as you've now made people afraid of upgrading Consul or other co-dependent services. Other similar failures are single-point-of-testing availability checking where instead you probably want two points of truth before shooting a service.

Now you risk people being afraid of upgrading probably anything, which means they will work around it, abstract it, or needlessly replace it with something they feel safer managing. The latter is at best a waste of time, at worst a time bomb until you find out what conditions this new thing breaks under.

This isn't advocating that you design without assuming anything can fail anywhere at any time; just pointing out that how often a service _should_ fail is extremely useful information when designing systems and designing fail safes, alerts, monitoring, etc.

4 comments

"I treat memcached in infrastructure as a very stable service."

I run memcached at a large scale. You are totally right. Every other year we will find ONE bad memcached node down. We use nutcraker instead of mcrouter for consistent hashing to each memcache node. Once i read "We also run a control plane for the cache tier, called Mcrib. Mcrib’s role is to generate up-to-date Mcrouter configurations" -- I was like oooooh boy, here we go....

Knowing memcache is a rock comes with experience though.

Our underlying hardware (AWS) is nothing like this reliable. We see regular (several times a year) failure of racks of machines or whole DCs.

Across the whole fleet (all services), we lose 1-10 servers per day as a baseline. Major events are then on top of that and can impact thousand of hosts at once.

What service is this?? This must be huge.
> I run memcached at a large scale

I don't believe you run it at the scale Slack does.

The people at Slack who decided to use Mcrouter (and created Mcrib) have experience running Memcached, Mcrouter and Nutcracker in production at two of the biggest web properties in the world.

Trust that they know whereof they speak.

You may not be wrong, in fact you are very likely right, but this is not an argument.

The larger an org gets the more likely it is to do weird things to mitigate organizational difficulties be them budget, human or otherwise.

Those types of things rarely show up in postmortems for obvious reasons.

"I don't believe you run it at the scale Slack does."

Definitely not. We host about %80 of elementary schools in the US. Not slack scale but definitely face many of the same issues :/

I think you nailed the real issue that caused the incident: saying "consul down == unhealthy memcached", then evicting the node. If Mcrib instead did some actual applicative healthchecks (e.g. memcached ping), which could be correlated with some system metrics (cpu, ram), it could avoid evicting those perfectly good nodes with a warm cache that just happen to have a restarting consul agent.

Granted, this is easy to say once the incident happened with an excellent postmortem, but this should be an industry-wide wakeup call: don't do this.

I have the same issue at work, where people treat a "prometheus node_exporter down" as a "the app on the machine is down". I've started to add the actual app name in our alerts, and now people don't freak out anymore when they see "down" alerts: oh node_exporter is down, but not the app? Don't panic and calmly check why.

It’s likely that the memcached install is so large that the underlying instances themselves are failing. When you have hundreds or thousands of instances, failures in the instances themselves become pretty regular.
I don't see this. I have thousands of long-lived instances - full VMs, not containers, running in our hardware.

If they start "going bad", something is wrong. That's a signal I wouldn't want to ignore.

It has happened - once an HBA in a storage node was causing occasional corruption, another time due to a communication failure people were building things with the wrong version of something which had a memory leak and would eventually summon the OOM killer. There have been other issues.

"Have you tried turning it off and back on again" is still a terrible system management strategy.

Failure rates in AWS are probably higher than what you're seeing in your own hardware.
Maybe. If you don't look, you don't know.

But given the number of people I've heard using "we're on AWS, out of my control" as an excuse, this appears to be an unofficial service they offer.

I can say with certainty this isn't strictly true. The failures should be relatively rare; when I say relatively I mean on the level of natural node failure. If natural node failure isn't survivable without special systems to quickly replace downed nodes you don't actually have an N+1 redundancy system. Thus, the pools aren't large enough :) Or, in this case, if they really are failing this much then having them always lose their cache is a major reliability hole.

It's a subtle difference. I think many operators get used to node failures being extremely common when they don't necessarily have to be. I suspect the note on "if they come back on their own ensure they're flushed" meaning they have something unusual causing ephemeral failures. If that's just "cloud networking" there isn't much they can do but it's almost always fixable.

> The failures should be relatively rare; when I say relatively I mean on the level of natural node failure.

And exactly how rare do you believe this to be?

In my experience, node failures at scale of hundreds to thousands of nodes are monthly to weekly, if not daily. Generally speaking, stability is a normal distribution. Young, new instances experience similar failure rates as old instances. If you have any sort of maximum node lifetime (for example, a week) or scale dynamically on a daily basis then you'll see a lot of failures.

Which still means you could implement a hard limit of 1 fail per hour and only allow more replacements with manual intervention. With a thousand nodes, several or hundreds failing within a few hours is so unlikely that you're probably better off preventing automatic failover in these cases.

But that generally mirrors my experience that automatic failover for stable software tends to cause more issues than it solves. A good (i.e. redundant hardware and software) Postgresql server is also so unlikely to fail that wrong detection and cascading issues from automatic failover are more likely than its actual benefits.

I think you're looking at it the wrong way. A server is never just postgres or memcached, there's always other stuff running, and it's that other stuff that can cause problems. Like maybe you're patching the fleet and a node fails to come back up, or due to misconfiguration the disk gets full.

I'd argue that stable systems are actually worse for operational stability as you become complacent and comfortable and when shit hits the fan you are unprepared.

more likely - they are using "spot instances" for memcached, which will cause them to be evicted fairly frequently.
Or horizontal autoscaling based on demand.