Hacker News new | ask | show | jobs
by jrockway 650 days ago
> If close fails after read, who gives a shit?

ulimit -n

You ignore errors on close, and one morning you wake up with your app in CrashLoopBackoff with the final log message "too many files". How do you start debugging this?

Compare the process to the case where you do log errors, and your log is full of "close /mnt/some-terrible-fuse-filesystem/scratch.txt: input/output error". Still baffling of course, but you have some idea where to go next.

1 comments

To start, you need to figure out why Kubernetes isn't retaining your stack trace/related metadata when the app crashes. That is the most pressing bug. Which is probably best left to the k9s team. You outsourced that aspect of the business of good reason, no doubt.

After they've fixed what they need to fix you need to use the information now being retained to narrow down why your app is crashing at all. Failing to open a file is expected behaviour. It should not be crashing.

Then maybe you can get around to looking at the close issue. But it's the least of your concerns. You've got way bigger problems to tackle first.

The app crashes because "too many files" includes the fd accept(2) wants to allocate so your app can respond to the health check.
A file not able to opened is expected, always! accept is no exception here. Your application should not be crashing because of it.

If I recall, Kubernetes performs health checks over HTTP, so presumably your application is using the standard library's http server to provide that? If so, accept is full abstracted away. So, if that's crashing, that's a bug in Go.

Is that for you to debug, or is it best passed on to the Go team?

There isn't a bug, it's resource exhaustion. You open a bunch of files and they fail to close. You don't log errors on the close, so you have no idea it's happening. Now your app is failing to open new file descriptors to accept HTTP connections. You get a fixed number of fds per app; ulimit -n. If you don't close files you've read, the descriptor is gone.

The bug in this case is in the filesystem that hangs on close. It happens on network filesystems. You can't return the fd to the kernel if your filesystem doesn't let you.

The bug of which we speak is in that your app is crashing. Exhausting open file handles is expected behaviour! Expected behaviour should not lead to a crash. Crashing is only for exceptional behaviour.

The filesystem hanging is unlikely to be a bug. The filesystems you'd realistically use in conjunction with Kubernetes are pretty heavily tested. More likely it is supposed to hang under whatever conditions has lead that to happen.

And, sure, maybe you'll eventually want to determine why the filesystem has moved into that failure state, but most pressing is that your app is crashing. All that work you put into gracefully handling the failing situation going to waste.

You're really hung up on Kubernetes but it was an incidental comment in a hypothetical story.

"You wake up and find out that Heroku's staff is anxiously awaiting your departure from your apartment to tell you that your app is down."