Hacker News new | ask | show | jobs
by hot_gril 1233 days ago
A bit of a tangent, but my only real problem with Heroku involved a premium DB. Turns out that upgrading to premium enables high availability (HA) by default, and I don't even remember if you can disable it. HA replicates asynchronously to the standby master, so a master failover can cause a small amount of data loss. For my application, this was unacceptable, and I would have preferred unavailability instead (see CAP theorem). Today I have enough experience to check the fine print for that kind of detail, but anyway such a big change should come with big bold letters IMO.

[Edit: To this day I'm still puzzled by what I'm about to describe, so idk if it's Heroku's fault or mine.] I got a call from my colleague one day saying our database had gone back in time. Evidently we lost an hour of records. The code wasn't even capable of deleting rows, and nobody had direct DB access but me, so after leafing through the docs I suspected a failover event caused it. Premium DBs also let you roll back the DB to a previous point in time, and we were able to recover most of our data this way, like Back to the Future. If this really was a failover event, it's super weird if that the backup was more up to date than the standby master, and that a whole hour (rather than minute) was lost.

3 comments

Having a HA follower is the only different between Premium and Standard tiers, so I'm not really sure what else you expected them to do in this case. Like, premium-6 is 2x the cost of the standard-6 plan explicitly because of the HA follower.
Yes. I was inexperienced, saw "high availability," and didn't realize the standby could fall behind and lose data.
Also, it's worded like a strict improvement when really it's a tradeoff. You're sacrificing the guarantee of persistence for more availability. I feel like most people who know what this means are not going to want it.
Yes, AWS is similar with their DB offerings. You can discourage it from doing any updates/reboots (which causes a failover), but ultimately if they want to failover, they can at any time.
Ouch. If it knows it's about to fail over from an update, it really should get the follower totally in sync with the leader first.
HA on rds uses synchronous replication - you won’t lose data on automated failover under any normal circumstances.
Ok that's fine
I wonder what architecture they use that can lose an hour of data?

Most architectures I see might lose a few milliseconds of writes in the typical case, and perhaps a second of writes in the worst case (which occurs when the master gets islanded with a couple of clients).

If it was really the HA causing this, maybe the follower had a temp outage before the leader and hadn't yet caught up.

I already don't want HA if there's a chance for even 1 second of data loss, but for those who can tolerate that, there really should be an upper bound on the staleness. If your leader fails, the follower shouldn't take over unless it knows it's close to up-to-date.