Debugging a TCP socket leak in a Kubernetes cluster

Y	Hacker News new \| ask \| show \| jobs

	Debugging a TCP socket leak in a Kubernetes cluster (blog.hasura.io)
	61 points by alberteinstein 2983 days ago

3 comments

mbrumlow 2983 days ago

I have read this near four times now. I can't really find any sustenance -- it appears to be a advertisement for a product wrapped in a unsolved issue.

No mention of lsof, netstat, or tcpdump, the normal tools used for troubleshooting these sort of problems. Without trying to sound to snarky I find it highly concerning that the industry is now working with tools like docker and Kubernties and we some how just throw out the fact that these sit on top of Linux.

Not to mention kubelet's ability to spot one of many turntables reaching a max still would have not solved this problem -- "Fundamentally, the node was unhealthy" -- is not a proper answer to the problem -- what was done to resolve the memory issue is. That could be increasing the tcp_mem to to support the workload, or finding a faulty user space program who is acting faulty -- all of which we have no clue because no real tools for troubleshooting this were used.

I mainly write this gripe because this appears to be a problemtisement, or a blogtisement. A "helpful" but not informative blog simply to provide a way to advertise your companies service at as the final blurb, leaving us with no real solution, resolution or a closing to the mystery of why tcp_mem was higher than expected.

link

kemcho 2983 days ago

> kubelets ability to spot one of many turntables reaching a max...

Hm...how do you propose kubernetes / kubernetes users solve these kinds of problems? It could be a fairly common error that’s hard to catch on a system of large number of nodes where you’re not supposed to actively think about the fact that you have nodes. What’s the right tooling / monitoring to have on a system of 20nodes where one node is basically screwed?

These kinds of things make me think the entire K8s/container abstraction is just broken.

link

pas 2983 days ago

Shouldn't this be a Linux (per process or per namespace) feature to limit resources available to userspace?

link

ecthiender 2983 days ago

I am not sure how you have read this 4 times, and missed these parts.

> leaving us with no real solution, resolution or a closing to the mystery of why tcp_mem was higher than expected

One user-space program was faulty and was not closing TCP sockets.

> what was done to resolve the memory issue is

The faulty program was fixed.

> Without trying to sound to snarky I find it highly concerning that the industry is now working with tools like docker and Kubernties and we some how just throw out the fact that these sit on top of Linux.

This I agree with, and this was the learning of the author, which they mention in the article.

Disclaimer: I work at Hasura

link

mbrumlow 2983 days ago

You are right, I did somehow miss or dismissed the two sentences that describe the already assumed problem once we started talking about tcp_mem hitting a max.

To me there was a ton a fan-fair about the problem and then 2 sentences about the problem and solution. Stopping and starting services is one of the worst ways to find out which service is using a resource. This is because one of the hardest parts of troubleshooting a problem like this is finding process in the bad state. Once you find the problem process you have a wealth of tools provided by Linux to collect a ton of data that will help ensure you actually solve the problem once you change code. This is what I expected from a blog post titled as such.

Had Hasura not been mentioned at all I would have likely passed on my comments -- and probably should have anyways. But there is something that gets my goat about blog post that are full of text with nothing really all that insightful to only find out it was a advertisement in the end.

Let me see if I can illustrate my point better.

* 311 words describing the problem and talking about Hasura.

* 81 words describing kubelet.

* 75 words of epiphany that containers run on Linux.

* 54 words of twittering.

* 52 words talking about the actual problem.

* 43 words on a feature request.

* 41 words plug for Hasura.

While writing this other replies have come in so I will try to address them here as well.

>He says it was a user space program not closing sockets. reply

He does say this but completely lacks on how he came to the conclusion and how he proved this before making a code change. The tools are there to show this as proof, at least something more than killing a service and seeing a number drop. A blog post like this should leave the reader with a full understanding of how and why. There maybe was a story in the edge case -- but we don't know because no real debugging efforts were reported by this blog post -- asking on twitter does not count.

> Hm...how do you propose kubernetes / kubernetes users solve these kinds of problems? It could be a fairly common error that’s hard to catch on a system of large number of nodes where you’re not supposed to actively think about the fact that you have nodes. What’s the right tooling / monitoring to have on a system of 20nodes where one node is basically screwed?

The problem is thinking you can absolve your self of all system administration task because you use kube or some other container based system.

Ideally you would have have built heath monitoring into your original application and you would have spotted the issue in your own dashboards long before you exhausted node level resources.

I will admit I may have been harsh, and am still being harsh, but that is because this at face value looks like a an attempt to promote a product via a blog post. Many companies push or even require this sort of thing. As far as quality, normally the types blog post like this that reach the front page of HN are chopped full of very useful information and dive deep in to the why and how. If I were a blog reviewer I would say this blog left me dry and wanting more details and the content did not warrant an entire blog post to be trumpeted around. It simply described mundane day to day work of a junior developer is expected -- required to do. If anything I am sad that the developer had a golden chance to learn some really neat stuff about Linux and twitter was his chosen resource :(

link

alberteinstein 2983 days ago

Using netstat/lsof/tcp_dump from inside the containers did not help unfortunately. Eventual next step was to check nodes and kernel logs revealed the issue rightaway.

link

BenjiWiebe 2983 days ago

He says it was a user space program not closing sockets.

link

kronin 2983 days ago

Seems that metrics providing visibility into the "network connectivity was flaky", like looking at response times (particularly 95/99 percentile) and digging into the pod, which gives you the node, would have isolated the problem pretty quickly to a single node. If a problem is isolated to a node, first thing to look at would be node logs. Would that pattern not have worked in this case?

link

Thaxll 2983 days ago

Checking the logs should be on everyone mind when dealing with issues.

link

alberteinstein 2983 days ago

Indeed! And when the logs show nothing being wrong? All k8s components were reporting that everything is fine.

link

Thaxll 2983 days ago

Because not all problem are userspace related, it's very common on Linux to check for kernel logs. Especially when you know how Kubernetes deals with networking ( using iptables ect ... )

link