| HN Mirror

Y	Hacker News new \| ask \| show \| jobs


	by residualmind 1225 days ago
	One thing I've noticed that is sometimes forgotten, especially at earlier stages is monitoring. You want to know how much self healing is actually happening. Let's say you have your self-healing system in place, say some k8s pods combined in a service with a little redundancy and very little state. Pods happily crash, another one takes over while a new one spins up. All is wonderful and you don't worry about your availability anymore because everything just always works. One day you decide to look into whats happening in your containers and are shocked because one pod crashes every 0.3 seconds. It just spins up, answers 1 request but then dies and a new one spins up...continuously. From the outside everything looks kind of ok but in reality you are wasting massive resources and have a nasty bug that might be losing you even data, consistency, creating load, etc... Some sort of monitoring is a good idea is what I'm saying.

4 comments

davewritescode 1225 days ago

Monitoring is super important++

But the nice thing about using an already resilient system like K8S is that pod crashes won't cause your customers to not be able to work and you can fix the issue in the background instead of having to throw up a status page and fix the problem immediately.

It's better to have a problem that your customers don't notice because it buys you time to figure out the issue.

link

codeduck 1225 days ago

Nobody ever cares about monitoring, until they need it. Then the tears flow deep and salty.

link

funcDropShadow 1224 days ago

That is one of the reasons why Brendan Gregg's USE [1] methodology is so great. USE stands for utilisation, saturation, errors. For every component, resource, or subsystem you should have at least a metric for each of these. Utilisation tells you: How much is it used? Saturation tells you: How near is it to the capacity limit or how much does it slow down because of load, and errors tell you e.g. when k8s pods restart all the time.

[1]: https://www.brendangregg.com/usemethod.html

link

andy_ppp 1225 days ago

What do you recommend I use to monitor my software then? Is there a good service I should use? Inside and outside the datacenter/AWS? What metrics should I monitor on Postgresql? Hacking attempts? There's a lot to consider.

link