Hacker News new | ask | show | jobs
by bhuga 675 days ago
I work on this at Stripe. There's a lot of reasons:

* Local dev has laptop-based state that is hard to keep in sync for everyone. Broken laptops are _really hard_ to debug as opposed to cloud servers I can deploy dev management software to. I can safely say the oldest version of software that's in my cloud; the laptops skew across literally years of versions of dev tools despite a talented corpeng team managing them.

* Our cloud servers have a lot more horsepower than a laptop, which is important if a dev's current task involves multiple services.

* With a server, I can get detailed telemetry out of how devs work and what they actually wait on that help me understand what to work on next; I have to have pretty invasive spyware on laptops to do the same.

* Servers in our QA environment can interact with QA services in a way that is hard for a laptop to do. Some of these are "real services", others are incredibly important to dev itself, such as bazel caches.

There's other things; this is an abbreviated list.

If a linux VM works for you, keep working! But we have not been able to scale a thousands-of-devs experience on laptops.

3 comments

I want to double check we’re talking about the same thing here. I’m referring to running everything inside a single VM that you would have total access to. It could have telemetry, you’d know versions etc. I wonder if there’s some confusion around what I’m suggesting given your points above.

I’m sure there are a bunch of things that make it the right choice for Stripe. Obviously if you just have too many things to run at a time and a dev laptop can’t handle it then it’s a dealbreaker. What’s the size of the cloud instances you have to run on?

> I’m referring to running everything inside a single VM that you would have total access to. It could have telemetry, you’d know versions etc. I wonder if there’s some confusion around what I’m suggesting given your points above.

I don't think there's confusion. I only have total access when the VM is provisioned, but I need to update the dev machine constantly.

Part of what makes a VM work well is that you can make changes and they're sticky. Folks will edit stuff in /etc, add dotfiles, add little cron jobs, build weird little SSH tunnels, whatever. You say "I can know versions", but with a VM, I can't! Devs will run update stuff locally.

As the person who "deploys" the VM, I'm left in a weird spot after you've made those changes. If I want to update everyone's VM, I blow away your changes (and potentially even the branches you're working on!). I can't update anything on it without destroying it.

In constrast, the dev servers update constantly. There's a dozen moving parts on them and most of them deploy several times a day without downtime. There's a maximum host lifetime and well-documented hooks for how to customize a server when it's created, so it's clear how devs need to work with them for their customizations and what the expectations are.

I guess its possible you could have a policy about when the dev VM is reset and get developers used to it? But I think that would be taking away a lot of the good parts of a VM when looking at the tradeoffs.

> What’s the size of the cloud instances you have to run on?

We have a range of options devs can choose, but I don't think any of them are smaller than a high-end laptop.

So the devs don’t have the ability to ssh to your cloud instances and change config? Other than the size issue, I’m still not seeing the difference. Take your point on it needing to start before you have control, but other than that a VM on a dev machine is functionally the same as one in a cloud environment.

In terms of needing to reset, it’s just a matter of git branch, push, reset, merge. In your world that sync complexity happens all the time, in mine just on reset.

Just to be clear, I think it’s interesting to have a healthy discussion about this to see where the tradeoffs are. Feels like the sort of thing where people try to emulate you and buy themselves a bunch of complexity where other options are reasonable.

I have no doubt Stripe does what makes sense for Stripe. I’d also wager than on balance it’s not the best option for most other teams.

PS thanks for chiming in. I appreciate the extra insights and context.

> So the devs don’t have the ability to ssh to your cloud instances and change config?

They do, but I can see those changes if I'm helping debug, and more importantly, we can set up the most important parts of the dev processes as services that we can update. We can't ssh into a VM on your laptop to do that.

For example, if you start a service on a stripe machine, you're sending an RPC to a dev-runner program that allocates as many ports as are necessary, updates a local envoy to make it routable, sets up a systemd unit to keep it running, and so forth. If I need to update that component, I just deploy it like anything else. If someone configures their host until that dev runner breaks, it fails a healthcheck and that's obvious to me in a support role.

> Just to be clear, I think it’s interesting to have a healthy discussion about this to see where the tradeoffs are. Feels like the sort of thing where people try to emulate you and buy themselves a bunch of complexity where other options are reasonable.

100% Agree! I think we've got something pretty cool, but this stuff is coming from a well-resourced team; keeping the infra for it all running is larger than many startups. There's tradeoffs involved: cost, user support, flexibility on the dev side (i.e. it's harder to add something to our servers than to test out a new kind of database on your local VM) come immediately to mind, but there are others.

There are startups doing lighter-weight, legacy-free versions of what we're doing that are worth exploring for organizations of any size. But remote dev isn't the right call for every company!

Ah! So that’s a spot where we’re talking past each other.

I’d anticipate you would be equally as able to ssh to VMs on dev laptops. That’s definitely a prerequisite for making this work in the same way as you’re currently doing.

The only difference between what you do and what I’m suggesting is the location of the VM. That itself creates some tradeoffs but I would expect absolutely everything inside the machine to be the same.

> I’d anticipate you would be equally as able to ssh to VMs on dev laptops. That’s definitely a prerequisite for making this work in the same way as you’re currently doing.

Our laptops don't receive connections, but even if they could, folks go on leave and turn them off for 9 months at a time, or they don't get updated for whatever reason, or other nutty stuff.

It's surprisingly common with a few thousand of them out there that laptop management code that removes old versions of a tool is itself removed after months, but laptops still pop up with the old version as folks turn them back on after a very long time, and the old tool lingers. The services the tools interact with have long since stopped working with the old version, and the laptop behaves in unpredictable ways.

This doesn't just apply to hypothetical VMs, but various CLI tools that we deploy to laptops, and we still have trouble there. The VMs are just one example, but a guiding principle for us been that the less that's on the laptop, the more control we have, and thus the better we can support users with issues.

I see in another comment thread you mentioned downloading the VM iso, presumably from a central source. Your comment in this thread didn't mention that so perhaps this answer (incorrectly) assumes the VM you are talking about was locally maintained/created?
To provide historical context, 10 years ago there was a local dev infrastructure, but it was already so creaky as to be unreliable. Just getting the ruby dependencies updated was a problem. The local dev was also already cheating: All the asynchronous work that was triggered via RabbitMQ/Kafka was getting hacked together, because trying to run everything that Infra/Queues did locally would have been very wasteful. So magic occurred in the calls to the message queue that instead triggered the crucial ruby code that would be hit in the end.

So if this was a problem back then, when the company had less than 1000 employees, I can't even imagine how hard would it be to get local dev working now

Sounds like you made a massive tradeoff in code coupling if your cant easily swap out remote for local queues etc. But i get it, when your thinking cloud first, understanding where your abstractions start or end can be a complex topic that creates flow on effects and often stop the whizz bang cloud demo code from copy/paste working in your solution. Depending on the stage of your company, this could be a feature or a bug. maybe you have so much complexity in your solution from spreading buisness logic across services that your solution only makes sense when your developing against prod-like-infra and in that scenario im seeing a benifit of having cloud first dev infra because keeping that beast tamed otherwise would be a monumental challange given the perchant for cloud-first to be auto-update-everything.
The way these problems are stated mighy make it seem like they're unsolvable without a lot of effort. I just want to point out that I've worked at places that do use a local, supported environment, and it works well.

Not saying it's the wrong choice for you, but it's a choice, not a natural conclusion.