Hacker News new | ask | show | jobs
by mlthoughts2018 2229 days ago
I think there is more to the story for some of these points and it can be dangerous to just take this at face value of best practices.

For example on the liveness / readiness probe item, the article says,

> “ The other one is to tell if during a pod's life the pod becomes too hot handling too much traffic (or an expensive computation) so that we don't send her more work to do and let her cool down, then the readiness probe succeeds and we start sending in more traffic again.”

But this is often a very bad idea and masks long term errors in underprovisioning a service.

If the contention of readiness / liveness checks vs real traffic is ever resulting in congestion, you need the failure of the checks to surface it so you can increase resources. If you set things up so this failure won’t surface, like allowing the readiness check to take that pod out of service until the congestion subsides, you’re only hurting yourself by masking the issue. It basically means your readiness check is like a latency exception handler outside the application, very bad idea.

The other item that is way more complicated than it seems is the issue about IAM roles / service accounts instead of single shared credentials.

In cases where your company has an enterprise security team that creates extremely low-friction tools to generate service account credentials and inject them, then sure, I would agree it’s a best practice to ruthlessly split the credentialing of every application to a shared resource, so you can isolate access and revoking.

But if you are on some application team and your company doesn’t have a mature enough security tooling setup managed by a separate security team, this can become a bad idea.

It can lead to superlinear growth in secrets management as there will be manual service account creation and credential propagation overhead for every separate application. Non-security engineers will store things in a password manager, copy/paste into some CI/CD tool, embed credentials as ENV permanently in a container, etc., all because they can’t create and maintain the end to end service account credential tools in addition to their job as an application team engineer. It’s something they think about twice per year and need off their plate immediately to move on to other work.

Across teams it means you end up with 20 different team-specific ways to cope with rapid growth of service accounts, leading to an even worse security surface area, risk of credential-based outages, omission of important testing because ensuring ability to impersonate the right service account at the right place is too hard, etc.

Very often it is a real trade-off to consider that one single service account credential that has just one way to be injected for every service is safer in the bigger picture.

Yes it means a credential issue for any service becomes an issue for all, and this is a risk and you want automated tooling to mitigate it, but it very often will be less of a risk than insisting on a parochial best practice of individual service account credentials, resulting in much worse and less auditable secrets workflows overall unless it is completely owned and operated by a central security team in such a way that it doesn’t create any approval delays or workflow friction for application teams.

1 comments

You of course should monitor the rate of liveness flapping for your services. The need to monitor it does not imply that it's a bad feature.
You can’t have it both ways. If you need to monitor it and take corrective action (which you do) then you shouldn’t rely on it.

This is an argument for making your liveness probe == readiness probe. It should just check pod availability in a minimal way, and if continuing to send the pod traffic based on this indicator turns out bad because of congestion, you want to see that causing errors and react, not let the scheduler take it out of service for new traffic.

You want liveness & readiness to check the same thing, and it should be a non-trivial check of service health that is also very low latency. And as long as that check is passing, keep sending traffic.

When the check fails, it should always be for a “hard down” reason that tells you the pod could not, regardless of traffic levels, accept traffic because it’s fundamentally internally down.

I don't want the pager to go off just because of some slight non-liveness. That's a likely outcome of high utilization (usually viewed as a good thing, isomorphic with low cost). If you're running really hot and a few tasks are shedding load by playing dead intermittently, that's OK up to a point; if a large portion of pods are doing that at a high rate, that might be bad. You might not even alert on it, just throw it up on a dashboard as informative indicator for operators.
> “ I don't want the pager to go off just because of some slight non-liveness.“

That’s just bad engineering. Really, one should want the pager to go off for that and be really pedantic to actually sniff out the root cause and actually fix it.

Hiding that type of issue by letting something like liveness/readiness policy tacitly conceal it is just going to result in a far worse or more systemic issue later with far worse pager disruptions to your life.

You’re skipping flossing every now and then only to need serious root canals later.