Hacker News new | ask | show | jobs
by yatish27 845 days ago
"While building a feature, we performed a database migration command locally, but it incorrectly pointed to the production environment instead, which dropped all tables in production."

This was scary.

6 comments

This implies that their production environment is mutable. As in, a command can run and change the production environment. That’s a no no.

But I give them a pass because they are a young company. My company was similarly reckless early on, but as we scaled, we had to tighten things up and turning to an immutable deployment approach has saved our asses so many times.

Is being a young company really an excuse when any half-decent engineer knows these things are bad?

Being a young company doesn't mean you ignore all the mistakes other people have made and figure them out for yourself.

I really surprised someone has access to the prod DB, and that it's possible for them to connect to it in dev (Meaning they have a copy of the credentials???).

Knowing it's bad and punting on it for later are both things that can be possible at the same time.
How does immutable work in regard to databases? As in „we need to add a column“?
Someone writes the migration, commits it, it passes the build and unit test stages of the pipeline, then the application as currently running passes all function and integration tests with (and this is important) both the prior and the revised schema. Your commit is tagged as release ready! Not long after, the automation tooling confidently executes the now-tested migration under machine control during the next deploy, everyone goes home happy with your shiny new published_at column, and no-one has directly touched prod.

Two days later the CTO sends everyone a stroppy email about "column bloat that should've been a table", ssh's into the personal instance that they've been keeping alive† since before you had funding and learned to launch servers as immutable black boxes, and whilst trying to prove a point by rolling it back manually, drops all tables by mistake when a cat treads on the keyboard

--

† excuse: "it's for reporting"

> Someone writes the migration, commits it, it passes the build and unit test stages of the pipeline, then the application as currently running passes all function and integration tests with (and this is important) both the prior and the revised schema. Your commit is tagged as release ready! Not long after, the automation tooling confidently executes the now-tested migration under machine control during the next deploy, everyone goes home happy

What happens if something goes really wrong after the production deploy? Is there a way to skip steps if you need to quickly push an emergency fix?

At our company, we have "an immutable DB", too, but when there's a critical emergency (say, full downtime), we can apply fixes manually. In that case, we run the tests after applying the fix.
Rookie move having the cat on the desk while ssh'd into prod...
That single sentence contains multitudes:

* Production should be immutable

* No one doing dev in a dev environment should have such trivial access to prod

* Are there still good reasons for a migration to drop all tables? I guess it's for the dev environment to etch-a-sketch to a known state?

Yikes.

> No one doing dev in a dev environment should have such trivial access to prod

It’s the new and hip ‘cloud’! Probably using planetscale or something like that, which (last I checked, maybe it changed but wasn’t on), doesn’t even have ip protections outside the mysql user settings (while bad, would’ve protected them).

> Are there still good reasons for a migration to drop all tables?

We haven’t found any.

I'd suggest reading up on what some of these new database providers are doing to help prevent or fix mistakes like this. Since you mentioned PlanetScale, I'll use them as an example.

1) PlanetScale has IP ACLs, which locks down passwords to specific IP addresses. [1] Additionally, with TailScale or another VPN solution, locking down based on IP isn't necessary foolproof.

2) They also have Safe Migrations. When enabled, it prevents DDL from being run directly on a database. [2] Additionally, using deploy requests for zero-downtime schema migrations also allows you to use reverts, which will revert the migration. [3]

[1] https://planetscale.com/blog/introducing-ip-restrictions

[2] https://planetscale.com/docs/concepts/safe-migrations

[3] https://planetscale.com/blog/behind-the-scenes-how-schema-re...

PlanetScale has Safe Migrations which you can enabled for your production DB (branch). Wondering though whether this will protect against everything mentioned here.

https://planetscale.com/docs/concepts/safe-migrations

Safe Migrations would prevent this completely. PlanetScale also allows you to restore multiple backups in parallel.
really? i've definitely done it before on my local as a quicker alternative to cleaning up the docker container/volume, doesn't seem that bad

ofc i'd think differently if i was also putting write-permission prod credentials into my machine, but luckily i haven't been in many places doing that

In addition, Every centralized storage, and every cloud provider lets you take a snapshot of a database's disks.

We can restore a 10TB disk in about 12 minutes. its much faster to snapshot, do migration, then if necessary, drop disk and remake from snapshot. (and then replay the replay any other WAL changes up to the exact second you want with a tool like barman, wall-e, pg_backreset, etc.

Postgresql backups are critical for disaster recovery, but the restores are so very, very slow, they should be a last resort.

We have been working on bytebase (https://github.com/bytebase/bytebase) for 3+ years to address this. With a change review workflow, environment propagations, and try not to disturb the dev flow if possible.
Extremely so in my opinion. In this day and age, viewing something like this means this company fundamentally doesn't understand security. It should simply be impossible to "incorrectly point to the production environment", because whoever ran this shouldn't even have access to those credentials in the first place.
Separation between office and production should be the norm.
At a company with a modern, mature process this simply should not be possible.
I did this once early in my career, the ice cold feeling in the pit of my stomach permanently etched this lesson into my soul lol.
I did too, but it was 25 years ago and quite honestly there was a different attitude about developers accessing production.

Today if a developer can bring down the operation accidentally, that’s a problem with the org more than the developer.

(On the other hand if a developer screws up the shared dev environment, it is his or her fault and they deserve the wrath of their coworkers.)