Hacker News new | ask | show | jobs
by randomdata 652 days ago
To start, you need to figure out why Kubernetes isn't retaining your stack trace/related metadata when the app crashes. That is the most pressing bug. Which is probably best left to the k9s team. You outsourced that aspect of the business of good reason, no doubt.

After they've fixed what they need to fix you need to use the information now being retained to narrow down why your app is crashing at all. Failing to open a file is expected behaviour. It should not be crashing.

Then maybe you can get around to looking at the close issue. But it's the least of your concerns. You've got way bigger problems to tackle first.

1 comments

The app crashes because "too many files" includes the fd accept(2) wants to allocate so your app can respond to the health check.
A file not able to opened is expected, always! accept is no exception here. Your application should not be crashing because of it.

If I recall, Kubernetes performs health checks over HTTP, so presumably your application is using the standard library's http server to provide that? If so, accept is full abstracted away. So, if that's crashing, that's a bug in Go.

Is that for you to debug, or is it best passed on to the Go team?

There isn't a bug, it's resource exhaustion. You open a bunch of files and they fail to close. You don't log errors on the close, so you have no idea it's happening. Now your app is failing to open new file descriptors to accept HTTP connections. You get a fixed number of fds per app; ulimit -n. If you don't close files you've read, the descriptor is gone.

The bug in this case is in the filesystem that hangs on close. It happens on network filesystems. You can't return the fd to the kernel if your filesystem doesn't let you.

The bug of which we speak is in that your app is crashing. Exhausting open file handles is expected behaviour! Expected behaviour should not lead to a crash. Crashing is only for exceptional behaviour.

The filesystem hanging is unlikely to be a bug. The filesystems you'd realistically use in conjunction with Kubernetes are pretty heavily tested. More likely it is supposed to hang under whatever conditions has lead that to happen.

And, sure, maybe you'll eventually want to determine why the filesystem has moved into that failure state, but most pressing is that your app is crashing. All that work you put into gracefully handling the failing situation going to waste.

You're really hung up on Kubernetes but it was an incidental comment in a hypothetical story.

"You wake up and find out that Heroku's staff is anxiously awaiting your departure from your apartment to tell you that your app is down."

Kubernetes is really here nor there. It's the crashing of the app that is our focus. An app should not be crashing on expected behaviour.

That's clearly a bug, and the bug you need to fix first so that you can have your failsafes start working again. You asked where to start and that's the answer, unquestionably.