Hacker News new | ask | show | jobs
by tyingq 1807 days ago
"Perhaps more interesting, though, is that for the last couple of years AWS has supported tunneling the SSH protocol over their SSM APIs if you use the SSM “document” called AWS-StartSSHSession."

That's interesting. I know some places go to great lengths to keep developers from accessing production without some sort of break-glass procedure through a jump host. I'm curious if they all know about this sort of loophole.

4 comments

SSM is much preferred to a jump host for a number of reasons.

1. You don't have to expose a jump host at all, which is one less exposed asset to manage and worry about.

2. Your security team should already be collecting Cloudtrail logs, so they get auditing of SSM/SSH "for free".

3. You can control SSM access via your SSO provider, which means you can trivially enforce a bunch of policies all in one place vs having to configure SSHD.

4. You can control SSM access via IAM.

5. You can limit session duration easily.

6. No more SSH agent hijacking, at least I don't think.

I also wouldn't call this a loophole, you have to explicitly have permissions to use SSM.

>I also wouldn't call this a loophole, you have to explicitly have permissions to use SSM.

Perhaps not the best wording on my part. I was aware of SSM, but not aware of the SSH tunneling features. I'm wondering if that's common. Is the SSH tunneling controlled separately, or on by default if SSM is on?

It is "on" by default, but the user still has to have the 'ssm:StartSession' permission (and probably others) to open the SSM session, and for some(?) operations you also still need to have the appropriate credentials (ssh keypair or a password) to login via SSH.

SSM Session Manager is one of the (if not the) preferred way to manage SSH access to instances in AWS. It's kinda hairy to set up, but it removes the need for bastion hosts/jump boxes for most use cases. From my experience I would say it is quite common.

Installing Yet Another Agent on your cluster/VMS and ensuring they are updated while the SSM agent got an upgrade I believe from python to go it still does a lot more than just provide ssh sessions correct?
I don't really know much about the agent. I'm not super concerned with keeping it updated though.
Also forgetting to quote ~ commands when going through a jump host leads to unexpected behavior — usually disconnection!
We've been using symops[0] which uses AWS-StartSSHSession document, but what's nice is it allows to set up different workflows for how people access servers. Plus all the advantages of SSM in general (IAM/SSO, CloudTrail etc).

[0] https://symops.com/

As the article states, it's completely controlled by IAM and whatever federated identity management you hook up to AWS, and the events are auditable via cloudtrail etc.
Accessing production on the command line is an anti-pattern. I can only think of one good reason to do it: If one is investigating a security incident where a hacker has broken into production and screwed around. Even then, one would want to snapshot the instance and take it offline to investigate it.

If there's some tricky bug in production, then one can create some sort of debugging service that runs on another port and deploy it to investigate the bug, or use management and monitoring tools. Copying files up to production is something that should be only done by an automated deployment script.

> If there's some tricky bug in production, then one can create some sort of debugging service that runs on another port and deploy it to investigate the bug,

If you are under time pressure to fix an escalation from a high profile customer, and you don't have such a service yet, do you make the customer wait for you to write one, or do you just use command line access? Or else, if you already have such a service, but it doesn't contain the necessary diagnostics to investigate this particular problem, do you make the customer wait for you to enhance it, or do you just use command line access? Or you make your debug service totally generic – allow it to run arbitrary code supplied by the user – in which case it can do anything the command line can, but how is that actually any more secure than more standard means of command line access? Plus, it is going to be adding friction which may slow down resolution.

> or use management and monitoring tools.

Often these work fine for some problems, and then you get a problem which they don't cover adequately, and you need to go beyond them.

>Accessing production on the command line is an anti-pattern

Seems to be at odds with

>then one can create some sort of debugging service that runs on another port and deploy it to investigate the bug

In many cases, that's just SSH. In most cases, I'm not copying files around, I want to connect to the real environment where firewall rules, API keys, permission systems, overlay networks, etc are in place. If there's a stuck process (let's say, lock contention) it's much easier to just SSH on and run gdb and check the stack to see what it's doing. Some languages like Java have pretty rich tooling out of the box for remotely connecting to processes. Others, like Python and Ruby, you just use gdb

Either way, there's no copying data necessary--you just need access to the running process. For a large system with hundreds of identical servers, I don't want to deploy a debug service everywhere; I just want to connect to the one with an issue and check that.

Snapshotting works sometimes, but I used stuck processes as an example since that's usually where all this remote/log/etc stuff falls apart. And, as-it-so-happens, things like lock contention tend to be really hard to recreate in synthetic or simulated environments that don't have real, authentic load.

Keep in mind that doesn't mean "go crazy with `root` in production". You can combine that strategy with scripting and tooling to drain/isolate/quarantine servers where the stuck process is still running but they don't have live traffic being routed to them.

I see this "ZOMG NO ONE TOUCH PROD" mentality a lot in highly regulated environments but it's usually more sustainable to try to isolate in-scope system's functionality as narrowly as possible to avoid bringing unnecessarily large amounts of things in scope (e.g. put the billing functionality in a microservice to limit PCI scope)

That's the way things should work and ought to be done.

But what about when things don't work like they should and ought to?

Want to debug network connectivity issues? See which process is hogging CPU? Investigate installation/delpoy problems? Reinvent the wheel, or use what's already there.