| HN Mirror

You are right, I did somehow miss or dismissed the two sentences that describe the already assumed problem once we started talking about tcp_mem hitting a max.

To me there was a ton a fan-fair about the problem and then 2 sentences about the problem and solution. Stopping and starting services is one of the worst ways to find out which service is using a resource. This is because one of the hardest parts of troubleshooting a problem like this is finding process in the bad state. Once you find the problem process you have a wealth of tools provided by Linux to collect a ton of data that will help ensure you actually solve the problem once you change code. This is what I expected from a blog post titled as such.

Had Hasura not been mentioned at all I would have likely passed on my comments -- and probably should have anyways. But there is something that gets my goat about blog post that are full of text with nothing really all that insightful to only find out it was a advertisement in the end.

Let me see if I can illustrate my point better.

* 311 words describing the problem and talking about Hasura.

* 81 words describing kubelet.

* 75 words of epiphany that containers run on Linux.

* 54 words of twittering.

* 52 words talking about the actual problem.

* 43 words on a feature request.

* 41 words plug for Hasura.

While writing this other replies have come in so I will try to address them here as well.

>He says it was a user space program not closing sockets. reply

He does say this but completely lacks on how he came to the conclusion and how he proved this before making a code change. The tools are there to show this as proof, at least something more than killing a service and seeing a number drop. A blog post like this should leave the reader with a full understanding of how and why. There maybe was a story in the edge case -- but we don't know because no real debugging efforts were reported by this blog post -- asking on twitter does not count.

> Hm...how do you propose kubernetes / kubernetes users solve these kinds of problems? It could be a fairly common error that’s hard to catch on a system of large number of nodes where you’re not supposed to actively think about the fact that you have nodes. What’s the right tooling / monitoring to have on a system of 20nodes where one node is basically screwed?

The problem is thinking you can absolve your self of all system administration task because you use kube or some other container based system.

Ideally you would have have built heath monitoring into your original application and you would have spotted the issue in your own dashboards long before you exhausted node level resources.

I will admit I may have been harsh, and am still being harsh, but that is because this at face value looks like a an attempt to promote a product via a blog post. Many companies push or even require this sort of thing. As far as quality, normally the types blog post like this that reach the front page of HN are chopped full of very useful information and dive deep in to the why and how. If I were a blog reviewer I would say this blog left me dry and wanting more details and the content did not warrant an entire blog post to be trumpeted around. It simply described mundane day to day work of a junior developer is expected -- required to do. If anything I am sad that the developer had a golden chance to learn some really neat stuff about Linux and twitter was his chosen resource :(