Hacker News new | ask | show | jobs
by cyberpunk 1847 days ago
Did we? I mean in prod I still frequently pop a shell in a pod, apk install some tool and reproduce issues outside of the app, sure I could trawl through gbs of istio logs or whatever but it would probably at least double incident resolution if I had no userland available on the machine with the problem…
1 comments

My understanding of the best practices is that shelling into prod is a breakglass only. Developers need approval to get escalated permissions to shell into prod in the first place. Further, containers shouldn’t run as root (security) so I don’t know how you would install software anyway. Logs and metrics should similarly be queryable via some central log explorer service like CloudWatch, Splunk, Prometheus, or even kubectl+grep. You shouldn’t have to manually page through GBs of logs.

Our images are often pretty stripped down (coreutils at most, often just a Go binary and some certs), so there aren’t many debugging tools available.

This might make our time to resolution slightly higher, but it keeps our incident count quite a lot lower because we very rarely need to break glass in the first place (this means you have to establish norms for logging, instrumentation, and tests).

Curl to /dev/shm/ ;)