Hacker News new | ask | show | jobs
by mrkurt 1330 days ago
This didn't actually kill VMs, but it _did_ prevent them from being rescheduled for upwards of an hour. The vast majority of apps running on the platform had 100% uptime throughout the incident. The ones that didn't rely on our rescheduling infrastructure to recover from app errors.
1 comments

Except my app isn't down due to an app error but a failed host in EWR which I couldn't escape from (due to the concurrent scheduling issues) https://status.flyio.net/incidents/v2dshzvy1mcl

EDIT: recognize that these may be poorly timed but unrelated incidents, but it has been frustrating to be trapped on a broken box for 12 hours and have the status page telling me it's just new deploys that are borked :)

I don't want to belabor this because we need to do a much better job making it obvious: but single node, development postgres databases are going to have downtime in our infrastructure. We'll get that host back for you, but you should _definitely_ add a replica if you care about availability.