Hacker News new | ask | show | jobs
by monster_group 1887 days ago
I don't keep track. All microservices use continuous deployment pipelines. If you check in code and it passes all the tests, it will make it out to prod some time in the next few hours.
2 comments

How do updates to a database work through that pipeline? Do migrations run through and rolled back automatically as needed?
Not in the context of micro-services, but we ran our production DB for years in a “both N and N+1 work” by following a few simple rules which turn out to be not that restrictive in practice.

Short version: have DB1 hold the transactional data (data generated while running the system). Have DB2a have the release-bound data (data about and connected to the code itself-settings, prices, whatever).

Have DB2a have views onto DB1 tables. Version a code only “knows about” DB2a but any transactional CRUD ops hit the tables on DB1.

Now version b of the code just needs to ship/create a DB2b and both a and b can run in parallel.

If you need to change the shape of DB1 tables, those changes need to be backward compatible (can only add nullable columns, no use of "select *", etc).

There’s a few details about how to make it fully practical, but that’s the gist and we ran than for about 12 years on a moderately heavily trafficked e-commerce site.

Database versioning and backwards compatibility has been something of an upcoming problem where I'm at as we've been getting better at CI/CD. Do you have any recommended resources for different approaches to it, or even some keywords that'll help when searching for such resources?
This sounds a lot like CQRS
Whoa I love this idea. I usually refer to it as "transactional" (always growing, represents the daily activity, read and write heavy) vs "lookup" data (almost exclusively read, not changed often). But still store them in the same database.

What ends up happening is we end up separating the two more with "cache this one, not that one" rather than two different databases.

I will explore this idea on my next greenfield project, whenever that happens.

Ran similar setup to this and worked pretty well!
This sounds interesting! is there any more detailed write up that you link me to? Thanks!
I looked briefly (and I could have sworn I posted our "nine rules" on HN years ago, but I couldn't find it in a quick search).

I'll look again later tonight more thoroughly to see if I've posted the mechanisms and restrictions publicly anywhere before. If I haven't, I'll try to dig it out of our old dev doc system and post them here, but I can't make any promises as the docs I recall are now over a decade old, so I'm not fully sure they exist any more. :)

The internal docs for this are not on any of our documentation systems that we've moved to zero-trust (as they're 12 years old and unchanged for 5+ years). I will probably be able to retrieve them when we're back in the offices; shoot me an email (in my profile) and I'll find a way to get something over to you with some significant delay.
Not the OP, but the way I handle this to to ensure that all migrations are backwards compatible - the current and new versions of the app/API/service must be able to run with the old and new database.

This requires a little discipline, but if you follow a few simple rules it's not really that arduous:

  - when adding a new column, it must have a default value set, or be nullable

  - don't drop any columns

  - don't rename any columns
Now, for those last 2, what I really mean is "don't do it in a single release" - if you want to make destructive changes, do it over the course of 2 releases.

  - release 1: remove dependencies on the column from the app/API/service

  - release 2: performs the database migration with destructive changes
It probably sounds more difficult than it actually is :) In reality, I don't make destructive changes that often though.
We always do it by not pushing breaking changes to the database. It’s extremely freeing. It does require some discipline to go back and cleanup things later, but not worrying about database “versions” is the way to go in my opinion.
Not gp but here's a possible answer: I usually require db migrations to have a "down" script as well but "down" is never applied automatically. I only auto-apply "up", and when a rollback is needed (which has been very infrequent in my case) I manually apply the "down" scripts using Flyway cli commands or by hand.
To add.

Same here. A forward only approach works best for us too. if you need to clean up a mess, it is a new migration script. It's too complex to try an work backwards. What if multiple scripts were ran? Then you have to roll back say script number 2 out of 5 and there were destructive operations. It becomes really hairy really quickly. So forward-only is the easiest to reason about.

Please do make sure that you have snapshots for restoring if you really mess up badly. I know its not always feasible to do snapshots before every deploy, but having a daily snapshot can bring you a lot of comfort.

If you built your own migration tool (highly encourage it, its not that hard to build a forward-only migration tool), then you can trigger selective snapshots/table dumps for only the tables that gets changed, and only for specific operations (updating schema, dropping columns, dropping table) before your migration scripts touches the db - that way you have a path to restore. You don't always need a full DB dump (say you have 500+ tables but only changing 2, 1 of which is destructive, thus the backup is tiny and quick). It also helps if different data sets live in isolation to help manage this kind of admin.

Do you ever release serious errors into prod?
The question is not IF, but WHEN.

So ideally you have some kind of monitoring that reports/shows how many services are alive (and where they live in a cluster), how many errors they generate etc. Then based on some thresholds you can take them out of circulation and let them cool down. If certain kinds of errors occurs, or at a certain frequency, the system can notify a site reliability engineer (or equivalent) to check it out. Then they can decide if it should be permanently removed and to log an internal support ticket and so forth for the developers or product teams.

Production issues are a part of life. You need to have some visibility on issues and their severity. Every company and tech stack is different, also depending on their SLA's and uptime promises.

Ads not rendering in an app might be less severe than a pump failure at a fuel station, so they have different kinds of monitoring and and reaction times to faults. Obviously things like hospitals, banks, airlines/aircraft manufacturers have way different requirements and infrastructure from say a system that manages all school libraries for a state/province.

There are too many products and approaches to mention here if you were looking for a list of those. I have one or two favorite approaches and a handful of tools for this kind of stuff, half of which is homemade, so not something you can google. But you can google it and see a few different approaches. "microservices monitoring java" or "microservices monitoring best practice" or something along those lines will get you on a path. Try to find 5 different approaches and reflect what each one is missing or how they may help you, and then ponder what would you like to see from a reporting system with hundreds/thousands of services.

And then obviously the the best lessons will come from production itself.

Good luck!

> Production issues are a part of life.

Only if you accept them. The alternative is to do very few, rigorously tested releases per year. This way you don't have production issues. That's how industries like banking make sure bank transfers and card payments work and people's money is not randomly lost... It's a shame many other industries just accept their product failing for users as something normal/inevitable.

I can't say my experience echoes your comment. I'm a former employer of a financial services (billing) company built around a mainframe code base started in the 70s. We probably qualify for the sort of business you had in mind with your comment.

We did four releases a year, across the entire organization (so mainframe and more modern platforms), on Saturday nights/early Sunday mornings. There was plenty of testing but there was still plenty of errors only found on the day of, and rushed to fix in the wee hours or daylight hours of Sunday morning.

The only thing that seemed to correlate with release quality was the overall risk of the release, i.e. the complexity and number of new features written during that quarter.

> We did four releases a year, across the entire organization (so mainframe and more modern platforms), on Saturday nights/early Sunday mornings. There was plenty of testing but there was still plenty of errors only found on the day of, and rushed to fix in the wee hours or daylight hours of Sunday morning.

This way, you had bugs in prod for less than a day once every quarter, as opposed to having buggy prod all the time, as is common in organizations doing Continuous Deployment.

That's adorable. You know that no matter how much testing you do, that something WILL slip through the cracks? Always.
Of course. Even the Space Shuttles blew up, twice. I'm guessing even pace makers and software in nuclear power plants have bugs. The point is, these things are exceedingly rare or have very limited scope (occur only in most obscure corner cases and also do limited damage), while in web companies which adopted Continuous Deployment, serious bugs are just common and I think seen as part of life.
Work in healthcare where we have heavily tested, quarterly releases. Well, we had a release today and some stuff was pretty horribly broken, despite being so heavily tested. We didn't adequately load test one piece of the new release under production-like conditions. Oops. Thankfully the fix was simple and a hotfix only took a couple of hours in total. Yet another lesson learned.
That's pretty bad, but nonetheless you detected and fixed it very quickly. Compare that to lingering bugs in Twitter iOS client (it's just broken on iPhone 5s, I guess they simply don't test on that device anymore), or happy random bugs in Windows 10 that appear after they CD an update on their users.
Then you get the worst of both worlds. You are in an industry where few very well tested releases are needed to meet SLA and customer expectations, but you have enough of the company looking at entirely different industries and wanting to follow their pipeline instead.
Sometimes, though thankfully less frequently (and for a less-disastrous definition of "serious") than I used to.

Luckily, a good CI/CD pipeline makes reversions just as easy as deployments. So even when you have errors, it's easier to correct than if you suddenly discovered "our deployment bash script / ansible playbook isn't as reversible as we thought it was"

Rarely. All features are gated by feature flags with the capability to dial up the feature gradually and dial down the launch instantly. I can monitor if the feature launch is going as expected by monitoring errors and metrics in the logs.
YES and this is why deployments to prod should go though many stages and have long bake-in time for critical applications.

The idea of deploying every commit all the way to prod is is very questionable.