Hacker News new | ask | show | jobs
by WYepQ4dNnG 1541 days ago
I don't see how this can scale beyond a single service.

Complex systems are made of several services and infrastructure all interconnected. Things that are impossible to run on local. And even if you can run on local, the setup is most likely very different from production. The fact that things work on local give a little to zero guarantees that they will work in prod.

If you have a fully automated infrastructure setup (e.g: terraform and friends), then it is not that hard to maintain a staging environment that is identical to production.

Create a new feature branch from main, run unit tests, integrations tests. Changes are automatically merged in the main branch.

From there a release is cut and deployed to staging. Run tests in staging, if all good, promote the release to production.

4 comments

> Complex systems are made of several services and infrastructure all interconnected.

Then maybe it's a forcing function to drive decoupling that tangle of code. That's a good thing!

The problem with staging environments is that replicating the functionality is easy but replicating the data, interactions, and behavior of people in a real environment is not. It's better to think in terms of early access releases and some kind of controlled roll out of new software so you catch bugs and issues before they impact most of your users.

I've seen many projects where the staging environment is a bad joke and where most real testing happens in production anyway. These days alternative strategies are being more clever about how you work with rolling out software to your production environments. There are various ways of doing this but it always boils down to having both the old and the new software running in the same environments and controlling who gets to see what using feature flags, dns, routing, etc. Also, if you run any kind of AB tests, this is what you would need. I've seen some companies do that but mostly this is more of an aspirational thing than an actual thing of course.

For the SAAS company I'm a CTO of, I actually stumbled on a nice mechanism when I realized that our customers' desire for dedicated setups lead us to a natural state where we update those last, thus making our multi-tenant environment a natural place to test / provide early access. Likewise our webapp rolls out immediately from our master branch but we package it up for Android/IOS less frequently because of the release bureaucracy Apple and Google impose. So that branch effectively is our stable release. And we have a matching web server for that branch as well that updates only when we merge to our production branch. The other server uses the same infrastructure (database, redis, etc.) but updates straight from our master branch. So, our staging server is part of our production environment and serves the same data, is exposed to the same user behavior, etc.

That also makes it easier to verify that old and new client software needs to work with both our latest server as well as the production servers for our dedicated setup.

you need both, in my experience working in SaaS, enterprises expect reliable and stable platforms. A staging environment is that extra safety net that can help preventing shipping a completely broken product. In a staging env you can turn on/off experiments and feature flags before doing that in production.

That said, you should also build the product so that you can run experiment and only turn on a feature for a small sub set of production customers, usually the free tier. To then gradually rollout to everybody else.

Last, staging env should be considered a production-grade env, thus if it breaks there should be SRE/DEV on call ready to jump and fix it.

Note that this also necessarily requires that critical variables and configuration options are stored in version control, rather than database. Bootstrapping staging databases with values necessary to run the application is a constant challenge.

Otherwise, your production environment would have massively different feature flags and other config than staging.

There are tools that can perform a diff of your databases and generate a change script. So we just diff local vs staging and capture the changes and check that in along with the code changes. Every change to the database create a schema change record so its easy to apply only the latest changes if they're versioned.
I think the problems they have with managing non prod environments is actually a symptom of having many systems. Staging environments are easy to maintain when it’s one system, when you have a complicated service oriented architecture, it becomes much more difficult and expensive to maintain non prod environments.