|
It would seem to me that this is the perfect time to pull in someone with more production experience. Perhaps they can use the existing tools to pull logs, or analyse it in some way. Maybe they've seen it before and already know the fix. Giving everyone production SSH experience is, in my experience, a way to run into all sorts of weirdness, not to mention endless frustration. In a modern automated infrastructure, that box is likely a container running on a virtual machine that's ephemeral and can (and probably will) go away at any moment based on any number reasons - maybe CD kicked off a new deployment, or maybe the load changed and the instance was selected for scale-down, or maybe our spot bid for that AZ isn't sufficient for keeping the instance around, maybe you being SSHed in and poking around impacted the health-check, and so it's being killed for not performing right. Theres many other problems, too - lots of applications are built in some way that there's simply no other way than secrets (passwords, api tokens, keys) to reach other systems, particularly third party systems.
So production boxes have production secrets, which you probably don't want to share with everyone. Giving everyone SSH access so they can, in theory, take nginx/kernel dumps as needed tends to imply giving superuser rights, which means they can do whatever they like. So, yes, pull in someone else - find some way to try and reproduce the problem NOT on production, if that fails, perhaps there's a way to grab enough detail or pull additional logs or network captures to identify the issue. If that fails, well okay, lets SSH in - but we need to coordinate that to ensure that instance does't go away, and doesn't impact production while you do it. |