Hacker News new | ask | show | jobs
by paranoidrobot 2093 days ago
That should be the exception, rather than the rule.
1 comments

And it is deeply frustrating when you run up against one of these exceptions and need to wade through some bureaucracy before you can investigate further.
It would seem to me that this is the perfect time to pull in someone with more production experience. Perhaps they can use the existing tools to pull logs, or analyse it in some way. Maybe they've seen it before and already know the fix.

Giving everyone production SSH experience is, in my experience, a way to run into all sorts of weirdness, not to mention endless frustration.

In a modern automated infrastructure, that box is likely a container running on a virtual machine that's ephemeral and can (and probably will) go away at any moment based on any number reasons - maybe CD kicked off a new deployment, or maybe the load changed and the instance was selected for scale-down, or maybe our spot bid for that AZ isn't sufficient for keeping the instance around, maybe you being SSHed in and poking around impacted the health-check, and so it's being killed for not performing right.

Theres many other problems, too - lots of applications are built in some way that there's simply no other way than secrets (passwords, api tokens, keys) to reach other systems, particularly third party systems. So production boxes have production secrets, which you probably don't want to share with everyone.

Giving everyone SSH access so they can, in theory, take nginx/kernel dumps as needed tends to imply giving superuser rights, which means they can do whatever they like.

So, yes, pull in someone else - find some way to try and reproduce the problem NOT on production, if that fails, perhaps there's a way to grab enough detail or pull additional logs or network captures to identify the issue. If that fails, well okay, lets SSH in - but we need to coordinate that to ensure that instance does't go away, and doesn't impact production while you do it.

the point people are trying to make is that if you are at the scale where a kernel bug or an nginx bug is borking your app, it's not the developers job to go poking around the system for a fix. It's the devops/infra people's job. In my world, if you want to investigate an nginx bug... "docker run -it nginx:latest /bin/bash" and go for it... find the issue, reproduce it, then fix it in the pipeline and deploy again. You didn't touch production at all. If your debugging relies on being ON PRODUCTION, you don't suffer from the scale you need to be on there in the first place.
Not all bugs are sufficiently cheaply reproducible outside the environment in which they are observed. It seems silly to tie your hands behind your back when you could just inspect what the computer is doing and then fix it.
Why can't your developers be "devops people"?
I think developers can be devops also - but they are different skills you need to learn and keep up. Someone good at, say, nodejs or python data science may not be the best at CUDA build compilation on CentOS. And being good at both makes you less good at each unless you're working 18hrs a day to keep up with everything.

There is also the case of ratios. An organization probably needs more developers in specific areas than DevOps, so with dedicated DevOps you could concentrate similar work from across several teams to a dedicated DevOps team that knows that work very well.

I've already written a response to this elsewhere in the thread, but developers are not all equal.

You can't hire twenty developers that all have the same skill/inclinations, the same interests, the same experience.

That's not to say that a DevOps Engineer is some super 10x rockstar developer - no, they're going to have the same variations on skill, interests, experience, etc.

It depends on your environment, but there's so much different tech once you count the entire stack, that I don't think it's reasonable to expect any one person to be an expert on all of it, or even a lot of it.

Sure, but "not all developers can be devops engineers" doesn't necessarily imply "none of your developers should have server access".
Lets go back to the original core assertion for the thread -

> > Every developer needs access to some servers for example to check the application logs.

> I fundamentally disagree with this.

So,

Developers shouldn't be reaching for SSH access to check logs.

If you're encountering problems that you can't diagnose through the existing logs, then you should probably be involving at least one other person - someone who has that production experience, who might have some additional knowledge about the problem.

If, and only if you've exhausted other avenues - then reach out for SSH access. But it should be a last resort, not the first resort. Plus, anyone SSHing into production boxes should really be very familiar with how production is configured. You can do more harm than good by poking around on a production box being completely unaware that you're causing alarms and outages elsewhere because you taking a memory dump of nginx caused in-flight requests to get timeouts and so-forth. The people with that experience are generally the DevOps/Infrastructrue folks since they're the ones who deal with production all day, and are going to get the pages if something goes wrong with that.